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What is The Nation’s Report Card? 



THE NATION'S REPORT CARD, the National Assessment of Educational Progress (NAEP), is the only nationally representative and continuing assessment 
of what America’s students know and can do in various subject areas. Since 1969, assessments have been conducted periodically in reading, mathematics, 
science, writing, history/geography, and other fields. By making objective information on student performance available to policymakers at- the national, 
state, and local levels, NAEP is an integral part of our nation’s evaluation of the condition and progress of education. Only information related to academic 
achievement is collected under this program. NAEP guarantees the privacy of individual students and their families. 

NAEP is a congressionally mandated project of the National Center for Education Statistics, the U.S. Department of Education. The Commissioner of 
Education Statistics is responsible, by law, for carrying out the NAEP project through competitive awards to qualified organizations. NAEP reports directly 
to the Commissioner, who is also responsible for providing continuing reviews, including validation studies and solicitation of public comment, on NAEP’s 
conduct and usefulness. 

In 1988, Congress created the National Assessment Governing Board (N/.GB) to formulate policy guh^lines for NAEP. The board is responsible for 
selecting the subject areas to be assessed, whic' may include adding to those specified by Congress: identifying appropriate achievement goals for each age 
and grade; developing assessment objectives; developing test specifications: designing the assessment methodology, developing guidelines and standards 
for data analysis and for repotting and disseminating results; developing standards and procedures for interstate, regional, and national comparisons; improving 
the form and use of the National Assessment; and ensuring that all items selected for use in the National Assessment are free from racial, cultural, gender, 
or regional bias. 
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FOREWORD 



This technical report summarizes some of the most sophisticated statistical 
methodology used in any survey or testing program in the United States. In its 23- 
year history, the National Assessment of Educational Progress has employed such 
state-of-the-art techniques as matrix sampling and item response theory models. 
Today it is the only survey using the advanced plausible values methodology, which 
uses a multiple imputation procedure in a psychometric context. 

The 1992 Trial State Assessment of mathematics followed the same basic 
design as that used for the 1990 Trial State Assessment. Properties of the 1992 
assessment common to the 1990 assessment include; 1) continuing the use of 
focused-BIB spiraling, item response theory models, and plausible values; 2) keeping 
the national and Trial State Assessment samples unduplicated; 3) doing separate 
stratifications and conditioning in each of the state samples; 4) making each state 
sample have power similar to the regional samples from the national assessment 
(this is how the sample sizes for the states were determined); 5) equating the 
aggregate of the state samples to the national scale (and doing this via a national 
subsample that also was representative of the aggregate of the states); 6) limiting the 
state samples to public schools; and 7) using power rules to determine which 
subgroup comparisons were supported by sufficient sample sizes (this became the 
"rule of 62"). 

There were several changes in the 1992 effort that should be noted The 
most obvious change was the inclusion of an assessment of fourth-grade public- 
school students in addition to the assessment of eighth-grade students (the only 
grade assessed in 1990). More items were added to the assessment (many of which 
were of the non-multiple choice variety) resulting in a change from a 7 block 
focused-BIB design in 1990 to a 13-block focused-BIB design in 1992. In addition, 
special booklets of items were administered to measure students’ estimation abilities. 
Another major change was that the National Assessment Governing Board 
established newwithin-grade Basic, Proficient, and Advanced achievement levels on 
the NAEF scale. These were improvements over the 1990 effort, and, in fact, 
represent the initial and primary way of reporting the 1992 results. Finally, there 
were some improvements in the conditioning process, which allowed more precise 
estimation of the correlation between content area scales. These changes 
necessitated a rescaling of the 1990 data (so that it is on the same scale as 1992) and 
a reanalysis of the 1990 results. For this reason, the reader of the 1992 reports may 
see some differences (generally slight) in the old reporting of the 1990 results 
compared to the new reporting of the 1990 results. 

For all the technical people working on the 1992 Trial State Assessment, the 
NAEP project has tested the limits of statistical theory and provided many 
opportunities to advance the state of the art. 
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The NAEP project is not only characterized by its elegant statistical 
procedures, but it is also noted for the dedicated professionalism of its staff. In 
hundreds of hours of technical advisory committees, I have not seen a single instance 
in which truth, honesty, and reason were compromised. It is the stubborn insistence 
that surveys are scientific activities and the relentless quest for improved 
methodology that have made NAEP credible for more than 23 years. 

Gary W. Phillips 

Associate Commissioner 

National Center for Education Statistics 



xvui 1 *3 




Chapter 1 
OVERVIEW: 



THE DESIGN, IMPLEMENTATION, AND ANALYSIS OF THE 
1992 TRIAL STATE ASSESSMENT PROGRAM IN MATHEMATICS 



Eugene G. Johnson, Stephen L. Koffier, and John Mazzeo 
Educational Testing Service 



The National Assessment shall develop a trial mathematics assessment survey instrument 
for the 8th grade and shall conduct a demonstration of ike instrument in 1990 in States 
which wish to participate, with the purpose of determining whether such an assessment 
yields valid, reliable State representative data. (Section 406 (i)(2)(C)(i) of the General 
Education Provisions Act, as amended by Pub. L. 100-297 (20 VS.C. 1221 e- 
W(2)(C)(i))) 

The National Assessment shall conduct a trial mathematics assessment for the fourth and 
eighth grades in 1992 and, pursuant to subparagraph (6)(D), shall develop a trial 
reading assessment to be administered in 1992 for the fourth grade in States which wish 
to participate, with the purpose of determining whether such an assessment yields valid, 
reliable State representative data. (Section 406 (l)(2)(C)(i) of the General Education 
Provisions Act, as amended by Pub. L. 100-297 (20 US.C 1221e-l(i)(2)(C)(it))) 



1.1 OVERVIEW 

In April 1988, Congress reauthorized the National Assessment of Educational Progress 
(NAEP) and added a new dimension to the program— voluntary state-by-state assessments on a 
trial basis in 1990 and 1992, in addition to continuing the national assessments that NAEP had 
conducted since its inception. In this report, we will refer to the voluntary state-by-state 
assessment program as the Trial State Assessment Program. These assessments, which are 
designed to provide state representative data, are distinct from the assessment designed to 
provide nationally representative data, referred to in this report as the national assessment. 
(This terminology is also used in all other reports of the 1990 and 1992 assessments.) It should 
be noted that the word trial in Trial State Assessment refers to the Congressionally mandated 
trial to determine whether such assessments can yield valid, reliable state representative data. 
All instruments and procedures used in the 1990 and 1992 Trial State and national assessments 
were previously piloted in field tests conducted in the year prior to the assessment. 
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Hie 1990 Trial State Assessment Program collected information on the mathematics 
knowledge, skills, understanding, and perceptions of a representative sample of eighth-grade 
students in public schools in 37 states, the District of Columbia, and two territories. Die second 
phase of the Trial State Assessment Program, conducted in 1992, collected information on the 
mathematics knowledge, skills, understanding, and perceptions of a representative sample of 
fourth- and eighth-grade students and the reading knowledge, skills, understanding, and 
perceptions of a representative sample of fourth-grade students in public schools in 41 states, 
the District of Columbia, and two territories. 1 

* 

Table 1-1 lists the jurisdictions that participated in the 1992 Trial State Assessment 
Program. About 110,000 students at each grade level participated in the mathematics 
assessments in those jurisdictions. Die students who were assessed in mathematics were 
administered one of 26 mathematics assessment booklets also used in NAEFs 1992 national 
mathematics assessment. In addition, all students participating in the Trial State Assessment in 
mathematics were given a special set of questions measuring estimation skills that was also 
administered as part of the national program. The estimation block was administered using a 
special audiotape to pace the students through the items. 

Tabic 1-1 

Jurisdictions Participating in the 
1992 Trial State Assessment Program 



Jurisdictions 



Alabama 


Hawaii i 


' Mississippi* 


Pennsylvania 


Arizona 


Idaho 


Missouri* 


Rhode Island 


Arkansas 


Indiana 


Nebraska 


South Carolina* 


California 


Iowa 


New Hampshire 


Tennessee* 


Colorado 


Kentucky 


New Jersey 


Texas 


Connecticut 


Louisiana 


New Mexico 


Utah* 


Delaware 


Maine * 


New York 


Virginia 


District of Columbia 


Maryland 


North Carolina 


Virgin Islands** 


Florida 


Massachusetts* 


North Dakota 


West Virginia 


Georgia 


Michigan 


Ohio 


Wisconsin 


Guam 


Minnesota 


Oklahoma 


Wyoming 



* These states did not participate in the 1990 Trial State Assessment Program. Illinois, Montana, and Oregon 
participated in the 1990 program but did not participate in the 199T; program. 

•• The Virgin Islands participated in the 1992 Trial State Assessment Program. However, in accordance with the 
legislation providing for participants to review and give permission, for release of their results, the Virgin Islands chose 
not to publish their results at grade 4. 



This report outlines the technical details of the 1992 Trial State Assessment in mathematics. A separate report on the 
technical details of the 1992 Trial State Assessment in reading will be published at the same time as the results from the 
reading assessment. 
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The mathematics framework and objectives established to guide both the Trial State 
Assessment and national assessment were developed for NAEP through a consensus project of 
the Council of Chief State School Officers, funded by the National Center for Education 
Statistics and the National Science Foundation. The framework and objectives were also used 
for the 1990 and 1992 national mathematics assessments. In addition, questionnaires completed 
by the students, their mathematics teachers, and principals or other school administrators 
provided an abundance of contextual data within which to interpret the mathematics results. 

The purpose of this report is to provide technical information about the 1992 Trial State 
Assessment in mathematics. It provides a description of the design for the Trial State 
Assessment and gives an overview of the steps involved in the implementation of the program 
from the planning stages through to the analysis and reporting of the data. The report describes 
in detail the development of the cognitive and background questions, the field procedures, the 
creation of the database for analysis (from receipt of the assessment materials through scanning, 
scoring, and creation of the database), and the methods and procedures for sampling, analysis, 
and reporting. It does not provide the results of the assessment — rather, it provides 
information on how those results were derived. 

Educational Testing Service (ETS) was the contractor for the 1990 and 1992 NAEP 
programs, including the Trial State Assessment. ETS was responsible for overall management of 
the programs as well as for development of the overall design, the items and questionnaires, 
data analysis, and reporting. Westat, Inc., and National Computer Systems (NCS) were 
subcontractors to ETS. Westat was responsible for all aspects of sampling and of field 
operations, while NCS was responsible for printing, distribution, and receipt of all assessment 
materials, and for scanning and professional scoring. 

This technical report provides supporting material for the series of reports that have 
been prepared for the 1992 Trial State Assessment Program in mathematics, including: 

• A State Report for each participating jurisdiction that describes the mathematics 
proficiency of the fourth* and eighth-grade public-school students in that jurisdiction 
and relates their proficiency to contextual information about mathematics policies 
and instruction. 

• The NAEP 1991 Mathematics Report Card for the Nation and the States, which provides 
data for all of the 44 jurisdictions that participated in the Trial State Assessment 
Program as well as the results from the 1992 national mathematics assessment. 

• The Executive Summary of the NAEP 1992 Mathematics Report Card for the Nation and 
the States , providing the highlights of the Mathematics Report Card. 

• The Data Compendium from the NAEP 1992 Mathematics Assessment for the Nation and 
the States , which includes tables of data relating performance on the mathematics 
assessment to a wide variety of demographic, perceptual, and experiential variables. 

• Interpreting NAEP Scales, which describes past, present, and possible future methods 
of reporting and interpreting NAEP data. These include percent correct statistics, 
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average percent correct, scale scores, scale anchoring, item mapping, and 
achievement levels. 

• Two Almanacs for e«xh jurisdiction, one for grade 4 and one for grade 8, that contain 
a detailed breakdown of the mathematics proficiency data according to the responses 
to the student, teacher, and school questionnaires for the population as a whole and 
for important subgroups of the population. There are five sections to each almanac: 

a The Student Questionnaire Section provides a breakdown of the proficiency data 
according to the students’ responses to questions in the three student 
questionnaires included in the assessment booklets. 

a The Teacher Questionnaire Section provides a breakdown of the proficiency 
data according to the teachers’ responses to questions in the mathematics 
teacher questionnaires.* 

a The School Questionnaire Section provides a breakdown of the proficiency data 
according to the principals’ (or other administrators’) responses to questions 
in the school characteristics and policies questionnaire. 

a The Scale Section provides a breakdown of selected questions from the 
questionnaires according to each of the scales measuring areas of 
mathematics in the assessment.* 

a The Mathematics Item Section provides the response data for each mathematics 
item in the assessment 



ORGANIZATION OF THE TECHNICAL REPORT 

This chapter provides a description of the design for the Trial State Assessment in 
mathematics and gives an overview of the steps involved in implementing the program from the 
planning stage through the analysis and reporting of the data. The chapter summarizes the 
major components of the program with references to the appropriate chapters for more details. 
The organization of this chapter, and of the report, is as follows: 



’Because both mathematic* and reading were uten ed at the fourth-grade level, the fourth-grade teacher questionnaire 
asked questions about mathematics and reading programs. The mathematics teachers of the students who participated in 
the mathematics assessment completed the mathematics questions and the reading teachers of the students in the reading 
assessment completed the reading questions. All teachers were asked to complete the questions about their educational 
background and training. For the mathematics assessment, only the data from the students’ mathematics teachers are 
included. 

^Scales were created for the content areas of Numbers and Operations; Measurement; Geometry; Data Analysis, Statistics, 
and Probability; and Algebra and Functions. An additional scale was created from items designed to measure estimation 
abilities. 
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• Section 12 provides an overview of the design of the Trial State Assessment 
Program. 

• Section 13 summarizes the development of the mathematics objectives and the 
development and review of the items written to measure those objectives. Details 
are provided in Chapter 2. 

• Section 1.4 discusses the assignment of the cognitive and background questions to 
assessment booklets and describes the focused-BEB spiral design. A complete 
description is provided in Chapter 2. 

• Section 13 outlines the sampling design used for the Trial State Assessment 
Program. A fuller description is provided in Chapter 3. 

• Section 1.6 summarizes the field administration procedures including securing school 
cooperation, training administrators, administering the assessment, and conducting 
quality control. Further details appear in Chapter 4. 

• Section 1.7 describes the flow of the data from their receipt at National Computer 
Systems through data entry, professional scoring, and entry into the database for 
analysis. Chapters 5 and 6 provide a detailed description of the process. 

• Section 1.8 provides an overview of the data obtained from the Trial State 
Assessment 

• Section 1.9 summarizes the procedures used to weight the data from the assessment 
and to obtain estimates of the sampling variability of subpppulation estimates. 
Chapter 7 provides a full description of the weighting and variance estimation 
procedures. 

• Section 1.10 describes the initial analyses performed to verify the quality of the data 
in preparation for more refined analyses, with details given in Chapter 9. 

• Section 1.11 describes the item response theory s, ' wales and the overall 
mathematics composite that were created for the primary analysis of the Trial State 
Assessment data. Further discussion of the theory and philosophy of the scaling 
technology appears in Chapter 8 with details of the scaling process in Chapter 9. 

• Section 1.12 provides an overview of the linking of the scaled results from the Trial 
State Assessment to those from the national ma^smatics assessment. Details of the 
linking process appear in Chapter 9. 

• Section 1.13 describes the reporting of the assessment results, with further details 
supplied in Chapter 10. 

• Appendices provide information about the participants in the obj , dives and item 
deve lopment process, a summary of the participation rates, a list of the conditioning 
variables, the IRT parameters for the mathematics items, the reporting subgroups, 
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composite and derived common background and reporting variables, a description of 
the processes for anchoring the mathematics composite and for defining achievement 
levels on that scale, and a discussion of further analysis of 1990 data. 



12 DESIGN OF THE TRIAL STATE ASSESSMENT IN MATHEMATICS 

Hie major aspects of the design for the Trial State Assessment in mathematics included 
the following: 

• Participation was voluntary. 

• Only fourth- and eighth-grade students in public schools were assessed. Students in 
private or parochial schools were not included in the program. A representative 
sample of schools was selected in each participating jurisdiction, and students were 
randomly sampled within schools. 

• Mathematics was assessed at the fourth- and eighth-grade levels. 

• The mathematics items used in the Trial State Assessm ent were also used in the age 
9/grade 4 and age 13/grade 8 national assessment and contained multiple-choice, 
short construct ed-response, and extended construct ed-response items. Some items 
required the use of calculators (four-function calculators at grade 4 and scientific 
calculators at grade 8), geometric shapes, and protractors/rulers. The total pool of 
mathematics items was divided into 13 15-minute blocks at each grade leveL Each 
student in the Trial State Assessment also was assessed with a special block 
measuring estimation skills that was administered using an audiotape to pace 
students through the items. 

• Background questionnaires given to the students, the students' mathematics teachers, 
and the principals or other school administrators provided for rich contextual 
information. Hie backgrcmd questionnaires for the Trial State Assessment were 
identical to those used in the age 9/grade 4 and age 13/grade 8 national assessment 

• A complex form of matrix sampling called a balanced incomplete block (BIB) 
spiraling design was used. With BIB spiraling, students in an assessment session 
received different booklets, resulting in a more efficient sample. This design also 
reduced student burden and provided for greater mathematics content coverage than 
would have been possible had evety student been administered the identical set of 
items. 

• The assessment time for each student was approximately 71 minutes. Each assessed 
student was assigned a mathematics booklet that contained two five-minute 
background questionnaires, one 3-minute background questionnaire, ana three of the 
thirteen 15-minute blocks containing mathematics items. Twenty-six different 
booklets were assembled. After the completion of this part of the assessment, each 
student was given another booklet containing the estimation questions. The 
estimation section took approximately 15 minutes. 




• The assessments took place in the five-week period between February 3 and March 
6, 199?,. One-fourth of the schools in each state were assessed each week throughout 
the first four weeks; the fifth week was reserved for the scheduling of makeup 
sessions. 

• Data collection, by law, was the responsibility of each participating jurisdiction. 

• Security and uniform assessment administration were high priorities. Extensive 
training was conducted to assure that the assessment would be administered under 
standard, uniform procedures. Fifty percent of the assessment sessions were 
monitored by the contractor’s staff. 



13 DEVELOPMENT OF MATHEMATICS OBJECTIVES, ITEMS, AND BACKGROUND 

QUESTIONS 

Similar to all previous NAEP assessments, the objectives for the Trial State Assessment 
in mathematics were developed through a broad-based consensus process managed by the 
Council of Chief State School Officers. Educators, scholars, and citizens, representative of many 
diverse constituencies and points of view, designed objectives for the mathematics assessment, 
proposing goals they believed students should achieve in the course of their education. After 
careful reviews of the objectives, assessment questions were developed that were appropriate to 
those objectives. Representatives from State Education Agencies provided extensive input 
throughout the entire development process. 

The framework used for the 1992 mathematics assessment was the same as the one 
adopted for the 1990 assessment and was organized according to three mathematical abilities 
and five content areas. The mathematical abilities assessed were conceptual understanding, 
procedural knowledge, and problem solving. Additionally, students’ abilities in estimation were 
assessed. Content was drawn primarily from elementary and secondary school mathematics up 
to, but not including, calculus. Hie content areas assessed were Numbers and Operations; 
Measurement; Geometry; Data Analysis, Statistics, and Probability; and Algebra and Functions. 

The Trial State Assessment included multiple-choice, short constructed-response, and 
extended constructed-response items. All items underwent extensive reviews by specialists in 
mathematics, measurement, and bias/sensitivity, as well as reviews by representatives from State 
Education Agencies. In addition, the items were reviewed by representatives of the National 
Assessment Governing Board (NAGB) in accordance with NAGB’s statutory responsibility for 
ensuring that all items selected for use in NAEP are free from racial, cultural, gender, or 
regional biases. The items were field tested on a representative group of students. Based on 
the results of the field test, items were revised or modified as necessary and then again reviewed 
for sensitivity, content, and editorial concerns. With the assistance of ETS/NAEP staff and 
outside reviewers, the mathematics Item Development Committee selected the items to include 
in the assessment. 

The 1992 mathematics assessment at grade 8 was designed to estimate trends in 
performance for states that participated in both the 1990 and 1992 Trial State Assessments. To 
permit linking to the 1990 assessment, some of the items used ?n 1990 were used again in 1992. 
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Of the 175 fourth-grade items used in 1992, 57 (16 short constructed-response items and 41 
multiple-choice items) had also been used in the 1990 program. Of the 205 eighth-grade items 
used in 1992, 76 (23 short constructed-response items and 53 multiple-choice items) had also 
been used in 1990. The rest of the items used in the 1992 program were newly created. 

Chapter 2 includes specific details about developing the objectives and items for the 
Trial State Assessment. The details of the professional scoring process are given in Chapter 5. 



1.4 ASSESSMENT INSTRUMENTS 

The assembly of cognitive items into booklets and their subsequent assignment to 
assessed students was determined by a balanced incomplete block (BIB) design with spiraled 
administration. Details of the BIB design are provided in Chapter 2. 

The student assessment booklets contained six sections and induded both cognitive and 
noncognitive items. In addition to three sections of cognitive questions, each booklet induded 
two 5-minute sets of general and mathematics background questions designed to gather 
contextual information about students, their experiences in mathematics, and their perceptions 
of the subject, and a section of questions designed to gather information about the students’ 
levels of motivation in taking the assessment and their familiarity with the types of assessment 
questions they encountered. 

In addition to the student assessment booklets, three other instruments provided data 
relating to the assessment — mathematics teacher questionnaires, school characteristics and 
polides questionnaires, and an exduded student questionnaire. 

The teacher questionnaires were administered to the fourth- and eighth-grade 
mathematics teachers of the students participating in the assessment At grade 4, the 
questionnaire consisted of three sections and took approximately 20 minutes to complete. The 
first section focused on teachers’ background and experience. Tie second section focused on 
classroom information related to mathematics. The third section, which was completed only by 
those reading teachers of students in the fourth-grade Trial State Assessment in reading, focused 
on classroom information about reading. At grade 8, the mathematics teacher questionnaire 
consisted of two sections and also took approximately 20 minutes to complete. The first section 
focused on teachers’ background and experience. Tie second section focused on mathematics 
classroom information. 

The school characteristics and policies questionnaire was given to the principal or other 
administrator in each participating school and took about 15 minutes to complete. Tiie 
questions asked about the principal’s background and experience, school policies, programs, 
facilities, and the composition and background of the students and teachers. 

The excluded student questionnaire was completed by the teachers of those students who 
were selected to participate in the Trial State Assessment sample but who were determined by 
the school to be ineligible to be assessed because they either had an Individualized Education 
Plan (IEP) and were not mainstreamed at least 50 percent of the time, or were categorized as 
Limited English Proficient (LEP). This questionnaire took approximately three minutes per 
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student to complete and asked about ths nature of the student’s exclusion and the special 
programs in which the student participated. 



1.5 THE SAMPLING DESIGN 

The target population for the Trial State Assessment Program consisted of fourth- and 
eighth-grade students enrolled in public schools. The representative sample of students assessed 
in the Trial State A &. jssment came from about 125 public schools for grade 4 and 100 public 
schools for grade 8 in each jurisdiction, unless a jurisdiction had fewer than 125 schools with a 
fourth grade or fewer than 100 schools with an eighth grade, in which case all cr almost all 
schools were asked to participate. The sample in each state was designed both to produce 
aggregate estimates for the state, and selected subpopulations (depending upon the size and 
distribution of the various subpopulations within the state), and also to enable comparisons to 
be made, at the state level, between administration with monitoring and without monitoring. 

The schools were stratified by urbanicity, percentage of Black and Hispanic students enrolled, 
and median household income. 

At each grade level, 30 students selected from each school provided a sample size of 
approximately 3,000 students per state. The student sample size of 30 for each school was 
chosen to ensure at least 2,000 students participating from each state at each grade level for 
each subject area, allowing for school nonresponse, exclusion of students, inaccuracies in the 
measures of enrollment, and student absenteeism from the assessment. 

The students within a school were sampled from lists of fourth- and eighth-grade 
students. The decisions to exclude students from the assessment were made by school 
personnel, as in the national assessment, ana used the same specific criteria for exclusion 
(described in section 1.4) as in the national assessment. Each excluded student was carefully 
accounted for to estimate the percentage of the state population deemed unassessable and the 
reasons for exclusir 

Chapter 3 describes tha various aspects of selecting the sample for the 1992 Trial State 
Assessment— the construction of the school frames, the stratification process, the updating of 
the school frame with new schools, the actual sample selection, and the sample selection for the 
field test. 



1.6 FIELD ADMINISTRATION 

The administration for the 1992 program and the 1991 field test involved a collaborative 
effort between staff in the participating states and schools and the NAEP contractors, especially 
Westat, thr field administration contractor. The purpose of the field test conducted in 1991 was 
to try out the items and procedures for the 1992 program. 

Each jurisdiction volunteering to participant in the 1991 field test and in the 1992 Trial 
State Assessment was asked to appoint a state coordinator who became the liaison between 
NAEP staff and the participating schools. At the local school level, an assessment administrator 
was responsible for preparing for and conducting the assessment session in one or more schools. 
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These individuals were usually school or district staff and were trained by Westat staff. In 
addition, Westat hired and trained a state supervisor for each state. The state supervisors were 
responsible for working with the state coordinators and overseeing assessment activities. Westat 
also hired and trained four to eight quality control monitors in each state to monitor 50 percent 
of the assessment sessions in 1992. During the field test, the state supervisors monitored all 
sessions. 

Chapter 4 describes the procedures for obtaining cooperation from states and provides 
details about the field activities for both the field test and 1992 program. Chapter 4 also 
describes the planning and preparations for the actual administration of the assessment, the 
training and monitoring of the assessment sessions, and a description of the responsibilities and 
roles of the state coordinators, state supervisors, assessment administrators, and quality control 
monitors. 



1.7 MATERIALS PROCESSING AND DATABASE CREATION 

Upon completion of each assessment session, school personnel shipped the assessment 
booklets and forms fro>.n the field to NAEP subcontractor National Computer Systems for 
professional scoring, entry into computer files, and checking. Then the files were sent to 
Educational Testing Service for creation of the database. Careful checking assured that all data 
from the field were received. More than 498,000 booklets or questionnaires were received and 
processed for the mathematics assessment The processing of these data is detailed in Chapter 
5. That chapter also details the printing, distribution, receipt, processing, and final disposition of 
the 1992 Trial State Assessment materials. 

The volume of collected data and the complexity of the Trial State Assessment 
processing design, with its spiraled distribution of booklets, as well as the concurrent 
administration of this assessment and the national assessments, required the development and 
implementation of flexible, innovatively designed processing programs and a sophisticated 
Process Control System. This system, which is described in Chapter 5, allowed an integration of 
data entry and workflow management systems, including carefully planned and delineated 
editing, quality control, and auditing procedures. 

The data transcription and editing procedures are also described in Chapter 5. These 
procedures resulted in the generation of disk and tape files containing various assessment 
information, including the sampling weights required to make valid statistical inferences about 
the population from which the Trial State Assessment sample was drawn. Before any analysis 
could begin, the data from these files had to undergo a quality control process at ETS. The files 
were then merged into a comprehensive, integrated database. Chapter 6 describes the 
transcribed data files, the procedure of merging them, or bringing them together, to create the 
Trial State Assessment database, and the results of the quality control process. 
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1.8 THE TRIAL STATE ASSESSMENT DATA 



Approximately 2,500 students at each grade were assessed within each state and the 
District of Columbia; apart from nonresponse, all fourth- and eighth-grade public-school 
students were assessed in Guam and the Virgin Islands. 

The basic information collected from the Trial State Assessment consisted of the 
responses of the assessed students to the 158 mathematics exercises at grade 4 and 183 exercises 
at grade 8. To limit the assessment time for each student to about one hour, a variant of matrix 
sampling called BIB spiraling was used to assign a subset of the full exercise pool to each 
student. At each grade level, the set of items was divided into 13 unique blocks, each requiring 
15 minutes for completion. Each assessed student received a booklet containing three of the 13 
blocks according to a design that ensured that each block was administered to a representative 
sample of students within each jurisdiction. Following the administration of this booklet, each 
student was given a special booklet that contained the audiotaped estimation items. The data 
also included responses to the background questionnaires (described in section 1.4 and 
Chapter 2). 

The national data to which the Trial State Assessment results were compared came from 
nationally representative samples of public-school students in the fourth and eighth grade. 

These samples were a part of the full 1992 national mathematics assessment in which nati -nally 
representative samples of students in public and private schools from three age cohorts wer*, 
assessed: students who were either in the fourth grade or 9 years old; students who were either 
in the eighth grade or 13 years old; and students who were either in the twelfth grade or 17 
years old. 

The assessment instruments used in the Trial State Assessment were also used in the 
fourth- and eighth-grade national assessments and were administered using the identical 
procedures in both assessments. The time of testing for the state assessments (February 3 to 
March 6, 1992) occurred within the time of testing of the national assessment (January 6 to 
April 3, 1992). However, the state assessments differed fror ' the national assessment in one 
important regard: Westat staff collected the data for the national assessment while, in 
accordance with the NAEP legislation, data collection activities for the Trial State Assessment 
were the responsibility of each participating jurisdiction. These activities included ensuring the 
participation of selected schools and student^ assessing students according to standardized 
procedures, and observing procedures for test security. To provide quality control of the Trial 
State Assessment, a random half of the administrations within each state was monitored. 



1.9 WEIGHTING AND VARIANCE ESTIMATION 

The Trial State Asse: sment used a complex sample design to select the students to be 
assessed in each of the participating jurisdictions. The properties of a sample from a complex 
design are very different from those of a simple random sample in which every student in the 
target population has an equal chance of selection and in which the observations from different 
sampled students can be considered to be statistically independent of one another. Die 
properties of the sample from the complex Trial State Assessment design were taken into 
account in the analysis of the assessment data. 
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One way that the properties of the sample design were taken into account was through 
the use of sampling weights which account for the fact that the probabilities of selection are not 
identical for all students. These weights also include adjustments for nonresponse of students 
and of schools. All population and subpopulation characteristics based on the Trial State 
Assessment data used the sampling weights in their estimation. Chapter 7 provides details on 
the computation of these weights. 

In addition to deriving appropriate estimates of population characteristics, it is essential 
to obtain appropriate measures of the degree of uncertainty of those statistics. One component 
of uncertainty is a result of sampling variability, which measures the dependence of the results 
on the particular sample of students actually assessed. Because of the effects of cluster selection 
(first schools are selected and then students are selected within those schools), observations 
made on different students cannot be assumed to be independent of each other (and, in fact, are 
generally positively correlated). As a result, classical variance estimation formulae will produce 
incorrect results. Instead, a variance estimation procedure which does take the characteristics of 
the sample into account was used for all analyses. This procedure, called the jackknife variance 
estimator, is discussed in Chapter 7. 

The jackknife variance estimator provides a reasonable measure of uncertainty for any 
statistic based on values observed without error. Statistics such as the average proportion of 
students correctly answering a given question meet this requirement but other statistics, based 
on estimates of student mathematics proficiency, such as the average mathematics proficiency of 
a subpopulation, do not Because each student typically responds to relatively few items within a 
particular mathematics content area, there exists a nontrivial amount of imprecision in the 
measurement of the proficiency of any given student. This imprecision adds an additional 
component of variability to statistics based on estimates of individual proficiencies. The 
estimation of this component of variability is discussed in Chapter 8. 



1.10 PRELIMINARY DATA ANALYSIS 

Immediately after receipt from NCS of the machine-readable data tapes containing 
students’ responses, all cognitive and noncognitive items were subjected to an extensive item 
analysis to assure that each item represented what it was purported to measure. 

Each block of cognitive items was subjected to item analysis routines, which yielded, for 
each item, the number of respondents, the percentage of students who selected the correct 
response and each incorrect response, the percentage who omitted the item, the percentage who 
did not reach the item, and she correlation between the item score and the block score. In 
addition, the item-analysis program provided summary statistics for each block, including 
reliability (internal consistency). These kinds of analyses were used to check on the scoring of 
the items, to verify the appropriateness of the difficulty level of the items, and to check for 
speededness. The results also were reviewed by knowledgeable project staff in search of 
anomalies that might signal unusual results or errors in creating the database. 

Tables of the weighted percentages of students choosing each of the possible responses 
to each cognitive and background item were created and distributed to each state and 
jurisdiction. Additional analyses comparing the data from the monitored sessions with that from 
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the unmonitored sessions were conducted to determine the comparability of the assessment data 
from the two types of administrations. Finally, differential item functioning analyses were 
carried out to identify items that were differentially difficult for various subgroups and to 
reexamine such items with respect to their fairness and their appropriateness for inclusion in the 
scaling process. Further details of the preliminary analyses conducted on the data appear in 
Chapter 9. 



1.11 SCALING THE ASSESSMENT ITEMS 

The primary analysis and reporting of the results from the Trial State Assessment used 
item response theory (IRT) scale score models. Scaling models quantify a respondent's 
tendency to provide correct answers to the items contributing to a scale as a function of a 
parameter called proficiency that can be viewed as a summary measure of performance across 
all items entering into the scale. Three distinct IRT models were used for scaling: 1) 3- 
parameter logistic models for multiple choice items; 2) 2-parameter logistic models for simple 
constructed-response items that were scored correct or ‘^correct; and 3) generalized partial 
credit models for extended constructed response items that were scored on a multi-point scale. 
Chapter 8 provides an overview of the scaling models used, with further o^ils on the 
application of these models provided in Chapter 9. 

A series of scales were created for the Trial State Assessment to summarise students' 
mathematics performance. These scales were defmed identically to those used for the scaling of 
the national NAEP fourth- and eighth-grade mathematics data. Five content area scales, based 
on the paradigm described in Chapter % were created to correspond to each of the following 
areas: Numbers and Operations; Measurement; Geometry; Data Analysis, Statistics, and 
Probability; and Algebra and Functions. An additional scale was created for other items 
designed to measure estimation abilities. Although the items comprising each scale were 
identical to those used for the national program, the item parameters for the Trial State 
Assessment scales were estimated from the combined data from all states and jurisdictions 
participating In the Trial State Assessment Item parameter estimation was based on an item 
calibration sample consisting of an approximatefy 25 percent sample of all the available data. 

To ensure equal representation in the scaling process, each state and jurisdiction was equally 
represented in the item calibration sample, as were the monitored and unmonitored 
administrations from each state and jurisdiction. Chapter 9 provides further details about the 
item parameter estimation. 

The fit of the IRT model to the observed data was examined within each scale by 
comparing the estimates of the empirical item characteristic functions with the t .eoretic curves. 
For binary^ scored items, nonmodel-based estimates of the expected proportions of correct 
responses to each item for students with various levels of scale proficiency were compared with 
the fitted item response curve; for the extended constructed response items, the comparisons 
were based on the expected proportions of students with various levels of scale proficiency who 
achieved each score leveL In general, the item level results were well fit by the scaling models. 

Using the item parameter estimates, estimates of various population statistics were 
obtained for each jurisdiction in the Trial State Assessment. The NAEP methods use random 
draws ("plausible values") from estimated proficiency distributions for each student to compute 
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population statistics. Plausible values are net optimal estimates of individual student 
proficiencies; instead, they serve as intermediate values to be used in estimating population 
characteristics. Under the assumptions of the scaling models, these population estimates will be 
consistent, in the sense that the estimates approach the model based population values as the 
sample size increases, which would not be the case for subpopulation estimates obtained by 
aggregating optimal estimates of individual proficiency. Chapter 8 provides further details on 
the computation and use of plausible values. 

In addition to the plausible values for each scale, a composite of the content area scales 
was created at each grade as a measure of overall mathematics proficiency. This composite was 
a weighted average of the content area scale plausible values in which the weights were 
proportional to the relative importance assigned to each content area as specified in the 
mathematics objectives. Consistent with the mathematics framework, the weights used to define 
the composite were somewhat different at grades 4 and 8. The definitions of the composites for 
the Trial State Assessment program at grades 4 and 8 were identical to those used for the 
national fourth- and eighth-grade mathematics assessments. 



102 LINKING THE TRIAL STATE RESULTS TO THE NATIONAL RESULTS 

The results from the Trial State Assessment were linked to those from the national 
NAEP through linking functions determined by comparing the results for the aggregate of all 
students assessed in the Trial State Assessment at each of grades 4 and 8 with the results for 
students of the matching grade within the State Aggregate Comparison (SAC) subsample of the 
national NAEP. The SAC subsample of the national NAEP for a given grade is a representative 
sample of the population of all grade-eligible public-school students within the aggregate of the 
41 participating states and the District of Columbia. Specifically, the grade 4 SAC subsample 
consists of all fourth-grade students in public schools in the states and the District of Columbia 
who were assessed in the national cross-sectional mathematics assessment Die grade 8 SAC 
subsample is equivalently defined for eighth-grade students who participated in the national 
assessment. 

For each grade, a linear equating within each scale was used to link the results of the 
Trial State Assessment to the national NAEP. The adequacy of linear equating was evaluated 
by comparing, for each scale, the distribution of mathematics proficiency based on the 
aggregation of all assessed students at each grade from the participating states and the District 
of Columbia with the equivalent distribution based on the students In the SAC subsample for 
the matching grade. In the estimation of these distributions, the students were weighted to 
represent the target population of public-school students in the specified grade in the 
aggregation of the states and the District of Columbia (the students from Guam and the Virgin 
Islands were not included in the equating). If a linear equating is adequate, the distribution for 
the aggregate of states and the District of Columbia and that for the SAC subsample will have, 
to a dose approximation, the same shape, In terms of the skewness, kurtosis, and higher 
moments of the distributions. The only differences in the distributions allowed by linear 
equating are in the means and variances. This was found to be the case. 

The linking was accomplished for each grade and scale by matching the mean and 
standard deviation of the scale prondencies across all students in each of grades 4 and 8 in the 
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Trial State Assessment (excluding Guam and the Virgin Islands) to the corresponding scale 
mean and standard deviation across all students in the matching grade SAC subsample. Further 
details of the linking are given in Chapter 9, 



1.13 REPORTING THE TRIAL STATE ASSESSMENT RESULTS 

Each state and jurisdiction that participated in the Trial State Assessment received 
multiple copies of a summary report providing the state’s results with accompanying text and 
tables, and including national and regional comparisons. These reports were generated fay a 
computerized report-generation system in which graphic designers, statisticians, data analysts, 
and report writers collaborated to develop shells of the reports in advance of the analysis. 

These prototype reports were provided to State Education Agency personnel for their reviews 
and comments. The results of the data analysis were then automatically incorporated into the 
reports that gave, in addition to tables and graphs of the results, interpretations of those results 
including indications of subpopulation comparisons of statistical and substantive significance. 

Each report contained state-level estimates of mean proficiencies, both for the state as a 
whole and for categories of the key reporting variables: gender, race/ethnicity, level of parental 
education, and community type. Results were presented for each settle and for the overall 
mathematics composite. Results were also reported for a variety of other subpopulations based 
on variables derived from the student, teacher, and school questionnaires. Standard errors were 
included for all statistics. 

A second report, the NAEP 1992 Mathematics Report Card for the Nation and the States , 
highlights key assessment results for the nation and summarizes results across the states and 
territories participating in the assessment. This report contains composite scale results 
(proficiency means, proportions at or above achievement levels, etc.) for the nation, each of the 
four regions of the country, and each jurisdiction participating in the Trial State Assessment, 
both overall and fay the primary reporting variables. In addition, overall results are reported for 
each of the content area scales. For the jurisdictions that participated in both the 1990 and 
1992 Trial State Assessments, reported results include trend comparisons to 1990, 

The third type of summary report is . entitled Data Compendium from the NAEP 1992 
Mathematics Assessment for the Nation and the States. Like the Report Card , the Compendium 
reports results for the nation and for all of the states and territories participating in the Trial 
State Assessment The Compendium contains most of the tables included in the Report Card plus 
additional tables that provide composite scale results for a large number of secondary reporting 
variables. 

The fourth type of summary report is a five-section almanac. Three of the sections of 
the almanac (referred to as proficiency sections) present analyses based on responses to each of 
the questionnaires (student, mathematics teacher, and school) administered as part of the Trial 
State Assessment. The fourth section of the almanac, the scale section, reports proficiency 
means and associated standard errors for the five mathematics content-area scales and the 
estimation scale. Results in this section are also reported for the total group in each state, as 
well as for select subgroups of interest. The final section of the almanac, the “p-value" section, 
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provides the total-group proportion of correct responses to each cognitive item included in the 
assessment. 

The production of the state reports, Mathematics Report Card , Data Compendium, and the 
almanacs required a large number of decisions about a variety of data analysis and statistical 
issues. For example, because the demographic characteristics of the fourth- and eighth-grade 
public-school students vaty widely by state, the proportions of students in the various categories 
of the race/ethnicity, parental education, and type of community variables varied by state. 
Chapter 10 documents the major conventions and statistical procedures used in generating the 
state reports. Mathematics Report Card, Data Compendium, and the almanacs. The chapter 
describes the rules, based on effect size and sample size considerations, that were used to 
establish whether a particular categoiy contained sufficient data for reliable reporting of results 
for a particular state. Chapter 10 also describes the multiple comparison and effect size-based 
inferential rules that were used for evaluating the statistical and substantive significance of 
subpopulation comparisons. 

To provide information about the generalizability of the results, a variety of information 
about participation rates was reported for each state and jurisdiction. This included the school 
participation rates, both in terms of the initially selected samples of schools and in terms of the 
finally achieved samples, including replacement schools. The student participation rates, the 
rates of students excluded due to Limiter English Proficiency (UEP) and Individualized 
Education Flan (IEP) status, and the estimated proportions of assessed students who are 
classified as IEP or LEF were also repor ed by state. 
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Chapter 2 



DEVELOPING THE MATHEMATICS OBJECTIVES, COGNITIVE ITEMS, 
BACKGROUND QUESTIONS, AND ASSESSMENT INSTRUMENTS 



Stephen L. Koffier 
Educational Testing Service 



2.1 OVERVIEW 

Similar to all previous NAEP assessments, the framework and objectives for the Trial 
State Assessment Program in mathematics were developed through a broad-based consensus 
process. Educators, scholars, and citizens, representative of many diverse constituencies and 
points of view, designed objectives for the mathematics assessment, proposing goals they 
believed students should achieve in the course of their education. The framework and objectives 
were initially developed for the 1990 mathematics assessment. They were used in both the 1990 
Trial State Assessment and the national NAEP programs and were used again for both 
programs in 1992. In both years, the Trial State Assessment was a subset of the national 
mathematics assessment. The same objectives and instruments were used in both. After careful 
reviews of the objectives, assessment items were developed that were appropriate to those 
objectives. All items underwent extensive reviews by specialists in mathematics, measurement, 
and bias/sensitivity, as well as reviews by state representatives. 

The objectives and item development efforts were governed by four major 
considerations: 

• As specified in the 1988 NAEP legislation, the objectives had to be developed 
through a consensus process involving subject-matter experts, school administrators, 
teachers, and parents, and the items had to be carefully reviewed for potential bias. 

• As outlined in the ETS proposal for the administration of the NAEP contract, the 
development of the items had to be guided by a Mathematics Item Development 
Panel. 

• As described in the ETS Standards for Quality and Fairness (ETS, 1987), all 
materials developed at ETS had to be in compliance with specified procedures. 

• As per federal regulations, all cognitive and background items had to be submitted 
to a federal clearance process. 
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Hiis chapter Indudes specific details about developing the objectives and items for the 
1992 mathematics assessment. The chapter also describes the instruments — the student 
assessment booklets (and the manner in which the items were organized into blocks to create 
the booklets), mathematics teacher questionnaires, school characteristics and policies 
questionnaires, and excluded student questionnaire. Many committees worked on the 
development of the framework, objectives, and items for the Trial State Assessment. A list of 
the committees and consultants participating in the development process for the 1992 
mathematics assessment is included in Appendix A. 



22 CONTEXT FOR PLANNING THE 1992 MATHEMATICS ASSESSMENT 1 

Anticipating the 1988 legislation that authorized the Trial State Assessment, in mid-1987 
the federal government arranged for a special grant from the National Science Foundation and 
the U.S. Department of Education to the Council of Chief State School Officers to prepare the 
framework and objectives and make recommendations about reporting for the Trial State 
Assessment Program in mathematics. 

The Council of Chief State School Officers established the National Assessment Planning 
Project to oversee their work for the Trial State Assessment. The National Assessment 
Planning Project, whose members included policymakers, practitioners, and citizens nominated 
by 18 national organizations, had two primary purposes — to recommend objectives for the 1990 
Trial State Assessment in eighth-grade mathematics and to make suggestions for reporting the 
results from that program. However, because the 1990 objectives had to be coordinated across 
the three grades, the objectives developed by the Project governed the entire NAEP 
mathematics assessment, including the national assessment at grades 4, 8, and 12 as well as the 
Trial State Assessment at grade 8. This was also true for 1992. 



23 ASSESSMENT DESIGN PRINCIPLES 

The Council of Chief State School Officers created a Mathematics Objectives Committee 
to recommend objectives for the assessment. The Committee consisted of a teacher, a school 
administrator, mathematics education specialists from various states, mathematicians, parents, 
and citizens. 

Two principles emerged during the discussions of the Mathematics Objectives 
Committee and became the basis for structuring the framework and objectives for the 
assessment. The first principle was that a national assessment, designed to provide state-level 
comparisons, should tot measure only those topics and skills already included in the objectives 
of all states nor be geared to the least common denominator of student preparation. The second 
principle was that the assessment should not be used to steer instruction toward one particular 
pedagogical or philosophical viewpoint to the exclusion of others that are widely held. 



‘For more details see Mathematics Objectives, 1990 Assessment (National Assessment of Educational Progress, 1988). 
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The objectives development was also guided by several other considerations: that the 
assessment should 1) reflect many of the states' curricular emphases and objectives; 2) reflect 
what various scholars, practitioners, and interested citizens believe should be included in the 
curriculum; and 3) maintain some of the content of prior assessments to permit reporting of 
trends in performance. Accordingly, the committee gave attention to several frames of 
reference: 

• states' goals and concerns, as reflected through analyses of state mathematics 
curriculum guides and the recommendations of state mathematics specialists; 

• a report on "Issues in the Field," based on telephone interviews with leading 
mathematics educators, and a draft assessment framework provided by a 
subcommittee of the Mathematics Objectives Committee; 

• tie draft of the Curriculum and Evaluation Standards for School Mathematics , 
developed by the National Council of Teachers of Mathematics through intensive 
work by leading mathematics educators in the United States (NCTM, 1987); and 

• the design of the 1986 mathematics assessment (NAEP, 1987). The framework for 
the 1986 NAEP mathematics assessment had 35 cells — seven content and five 
process areas. Because there were so many cells, tbs weightings assigned to some 
of the cells in the 1986 framework did not result in a sufficient number of items to 
provide reliable measures of students’ knowledge and skills. As a result, it was 
decided that the outline or matrix guiding the development of the 1990 mathematics 
assessment had to be simpMed-nrather than hiving a large number of cells, 
necessary complexity could be reflected through the designation of specific abilities 
and topics in each content area. 



2.4 ASSESSMENT DEVELOPMENT PROCESS 

The Mathematics Objectives Committee developed a draft framework, set of objectives, 
and set of sample items, which were distributed to the mathematics supervisor in each of the 50 
State Education Agencies. These supervisors convened a panel that reviewed the draft and 
returned comments and suggestions to the project staff. Copies of the draft were also sent to 25 
mathematics educators and scholars for review. The Mathematics Objectives Committee 
incorporated the recommendations made and formulated their final recommendations, which 
were approved by the National Assessment Planning Project Steering Committee. 

The framework and objectives were then submitted to the National Center for Education 
Statistics, which forwarded them for review to the Assessment Policy Committee, a panel that 
advised on NAEP policy at that time. The Assessment Policy Committee approved the 
objectives with minor provisions about the feasibility of full implementation. 3 The framework 



This action was contained in a statement issued by the Assessment Policy Committee’s Executive Committee on 
April 29, 1988. The recommendations were ratified by the full committee on June 18, 1988, with two stipulations: that 
the objectives be so weighted as to permit reporting on trends in performance; and, with regard to the use of calculator- 
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and objectives were refined by NAEP’s Mathematics Item Development Panel, reviewed by the 
Task Force on State Comparisons, and resubmitted to the National Center for Education 
Statistics for adoption. 



2 * MATHEMATICS FRAMEWORK 

The framework adopted for the 1990 mathematics assessment (and therefore also for the 
1992 mathematics assessment) is organized according to three mathematical abilities and five 
content areas. The mathematical abilities assessed were conceptual understanding, procedural 
knowledge, and problem solving. Content was drawn primarily from elementary and secondary 
school mathematics up to, but not including, calculus. The content areas assessed were numbers 
and operations; measurement; geometry; data analysis, statistics, and probability; and algebra 
and functions. 

The assignment of the percentages of assessment items to be devoted to each 
mathematical ability and content area was an important component of the framework 
development because such weighting reflects the importance or value given to each area at each 
grade level The National Assessment Planning Project wanted to create an assessment that 
would be forward-thinking and could lead instruction; thus, they decided to give more emphasis 
than in previous assessments to problem solving, geometry, and algebra and functions, and less 
to numbers and operations. 

The distribution of items by mathematical ability and mathematical content area for each 
grade as defined in the framework is provided in Table 2-1 and Table 2-2. 



Table 2-1 

NAEP Mathematics Framework: 

Percentage Distribution of Items by Grade and Ability 



Ml 


Grade 


4 


8 


Conceptual Understanding 


40% 


40% 


Procedural Knowledge 


30% 


30% 


Problem Solving 


30% 


30% 



active items and open response questions, that the assessment be developed within the resources available for its 
administration. 
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Table 2-2 



NAEP Mathematics Framework: 

Percentage Distribution of Items by Grade and Content Area 



Mathematical 'Content Area 


Grade 


4 


8 


Numbers and Operations 


45% 


30% 


Measurement 


20% 


15% 


Geometry 


15% 


20% 


Data Analysis, Statistics, and Probability 


10% 


15% 


Algebra and Functions 


10% 


20% 



2.6 COGNITIVE ITEM DEVELOPMENT 

The 1992 mathematics assessment was designed to estimate trends from 1990 in national 
performance at all three grade levels and at grade 8 for states that participated in both the 1990 
and 1992 Trial State Assessments. 

Both the 1990 and 1992 Trial State Assessments in mathematics included constructed- 
response and multiple-choice items. The 1992 assessment relied much more on constructed- 
response items than did the 1990 assessment. In addition to short constructed-response items, 
the 1992 assessment included extended constructed-response questions. The extended 
constructed-response mathematics items, which were used for the first time in the 1992 
assessment, called for the student to work through a complex problem requiring about five 
minutes to complete, and were scored on a 04 scale. 

All of the constructed-response items were designed to provide an extended view of 
students* mathematical knowledge and skills. Building on recommendations from the report of 
the Council of Chief State School Officers, the NAEP Mathematics Item Development Panel 
suggested that constructed-response items be used to assess objectives in the framework that are 
best measured using such types of items (e.g., the ability to articulate mathematical ideas, draw 
figures, or generalize function relationships). About half of the constructed-response questions 
required short answers; the other half, including the extended constructed-response questions, 
required the < ulity to formulate and demonstrate more detailed problem-solving skills. 

To permit linking to the 1990 assessment, some of the items used in 1990 were used 
again in 1992. At grade 4, 57 items that were used in the 1990 program were carried forward to 
the 1992 program (16 short constructed-response items and 41 multiple-choice items). At grade 
8, 76 items were used again (23 short constructed-response items and 53 multiple-choice items) 
and at grade 12, 80 items were reused (24 short constructed-response and 56 multiple-choice 
items). The rest of the items used in the 1992 program were newly created. In total, the 1992 
assessment included many more items than did the 1990 assessment. 
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Similar to the development of the items for the 1990 assessment, a carefully developed 
and proven series of steps were used to create multiple-choice and short constructed-response 
items for 1992 that reflected the objectives. 

1) The Mathematics Item Development Panel provided guidance to, the NAEP staff 
about how the objectives could be measured given the constraints of resources and 
the feasibility of measurement technology. The Panel made recommendations 
about priorities for the assessment and types of items to be developed. 

2) The content and ability classifications of the items from the 1990 assessment that 
were released to the public were determined so that new items could be developed 
to meet those specifications. This was necessary so that the overall proportion of 
items for 1992 would continue to meet the proportions called for in the framework. 

3) Item writers, both within and outside ETS, with f ubject-matter expertise and skills 
and experience in creating items according to specifications, wrote assessment items. 

4) The items were reviewed and revised by NAEP /ETS staff and external reviewers. 

In addition, the items were reviewed by representatives of the National Assessment 
Governing Board in accordance with the board’s statutory responsibility for ensuring 
that all items selected for use in NAEP are free from racial, cultural, gender, or 
regional biases. 

5) Representatives from the State Education Agencies met and reviewed all items and 
background questionnaires (see section 2.9 for a discussion of the background 
questionnaires). 

6) Language editing and sensitivity reviews were conducted according to ETS quality 
control procedures. 

7) Field test materials were prepared, including the materials necessary to secure 
Office of Management and Budget clearance. 



The field tests for the multiple-choice and short constructed-response items were 
conducted in February 1991 in 22 states, the District of Columbia, and the Virgin Islands. The 
intent of the field test was to try out the items and procedures and to give the states and the 
contractors practice and experience with the proposed materials and procedures. About 500-600 
responses were obtained for each mathematics item in the field test. 

The field test data were scored and analyzed in preparation for meetings with the 
Mathematics Item Development Panel and the Background Panel. Using item analysis 
procedures, which provide a variety of statistics about each item in the field test (including p- 
values, biserial correlations, and item response theory plots), committee members, ETS test 
development staff, and NAEP/ETS staff reviewed the materials to determine 

• the most appropriate items for use in the 1992 assessment in accordance with 
content specifications (that they met the content and ability specifications in the 
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framework) and statistical specifications (that their biserial correlation was not less 
than 0.20); 

• the need for revisions to items tha f lacked clarity or had ineffective item statistics; 
and 

• appropriate timing for assessment items. 



Once the pool of newly created items was established, the items were assembled into 
nine different "blocks* 1 (15-minute sections established according to statistical guidelines 
developed at the beginning of the process). 5 The new blocks were assembled taking into 
account the speededness data from the field test and the fact that extended constructed-response 
items would be included in certain of the blocks. 

The development of the extended constructed-response items was on a somewhat 
different set of timelines than the multiple-choice and short constructed-response items. A 
committee of mathematics educators from elementary and secondary schools, colleges, and state 
education agencies met early in 1991, and worked with ETS/NAEP mathematics test 
development staff to develop extended constructed-response items and scoring guides. These 
items were carefully reviewed according to the procedures required by the ETS Standards for 
Quality and Fairness (ETS, 1987), including content and sensitivity reviews. 

Twelve items at each grade level were field tested in May 1991 in urban, suburban, and 
rural school districts in New Jersey and Pennsylvania. Each student was administered two 
extended constructed-response items and each item was given to approximately 50-100 students. 
ETS/NAEP mathematics test development staff scored the extended constructed-response items 
at a special two-day scoring session. Based on the distribution of scores and on the content 
specifications, the final set of extended constructed-response items was selected by ETS/NAEP 
mathematics test development staff and reviewed by the Mathematics Item Development 
Committee. These items were included as the last item in the appropriate blocks. 

Once the total set of items had been selected and assembled into blocks, all items and 
blocks were reviewed again by ETS/NAEP staff for content, measurement, and sensitivity 
concerns. In addition, another meeting of representatives from State Education Agencies was 
convened to review the field test results and final set of items. The federal clearance process 
was initiated in August 1991 with the submission of materials to the National Center for 
Education Statistics. Revisions were made in accordance with changes required by the National 
Center for Education Statistics and NAGB and the final clearance package was approved in 
September 1991. 

The overall pool of items (new and trend items) for the 1992 Trial State Assessment 
consisted of 158 items in grade 4 (54 short and 5 extended constructed-response items, and 99 
multiple-choice items) and 183 items in grade 8 (59 short and 6 extended constructed-response 



3 In total, there were 13 blocks at each grade level, nine newly created block and four trend blocks 
that had been used in the 1990 mathematics assessment. 
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items, and 118 multiple-choice items). For each grade, about 40 percent of the assessment time 
(35 percent of the items) was devoted to short and extended constructed-response questions. 

Table 2-3 (grade 4) and Table 2-4 (grade 8) provide the number and percentage of items 
for each content and ability group for each grade. These items were also used in the national 
mathematics assessment. 

Students participating in the 1992 Trial State Assessment were also administered a 
special set of items measuring estimation skills. Hie estimation items were presented to 
students in a separate booklet, accompanied by a paced audiotape, and were the same items 
that were included in the special studies that were part of the 1990 and 1992 national NAEP 
mathematics assessment. The estimation items assessed students’ skills in making estimates 
appropriate to a wide variety of situations. The Information from these items supplemented the 
data for the content areas of numbers and operations and measurement. At grade four, there 
were 20 items measuring estimation skills and at grade 8, there were 22 items. 

Table 2-5 provides the number of questions that were included in the assessment at each 
grade level and Table 2-6 provides the number of questions at each grade level for the 1992 
assessment that were contributed by the 1990 trend blocks (except for the estimation block 
which was an intact block administered as part of a special study in 1990 in the national 
assessment). 



2.7 BLOCK DESIGN 

The assessment included 13 different 15-minute blocks of multiple-choice and 
constructed-response items at each grade level At each grade level, four blocks used in 1990 
were retained for reassessment in 1992, including one calculator block and the protractor/ruler 
block; nine blocks were newly developed. Of the 13 blocks at each grade level: 

• Three blocks included items designed to be answered using a calculator. For the 
grade 4 calculator blocks, students were provided with a four-function calculator, 
while at grade 8 students were provided with a scientific calculator. 

• One block contained items requiring the use of a ruler at grade 4 and 
protractor/ruler at grade 8. 

• One block contained questions about geometry for which students were given a set 
of geometric shapes .o use. 

• Five blocks at grade 4 and six blocks at grade 8 included extended constructed- 
response questions. 

Table 2-7 (grade 4) and Table 2-8 (grade 8) provide the composition of each block of 
items administered in the Trial State Assessment Program. 
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Table 2-5 



Total Number of Items In the 1992 Trial State Assessment in Mathematics 



Use of Questions 



Number of Questions at Each Grade 



Grade 4 



Grade 8 



Grade 4 and Not Grade 8 



Grade 8 and Not Grade 4 



78 



103 



Grades 4 and 8 


80 


Total per grade 


158 


183 


Number short constructed-response 


54 


59 


Number extended constructed-response 


5 


6 


Number multiple-choice 


99 


118 
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Tabic 2-6 




Number of Items Contributed by 1990 Blocks in the 1992 Trial State Assessment in Mathematics 



Use of Questions 



Grade 4 and Not Grade 8 
Grade 8 and Not Grade 4 
Grades 4 and 8 
Total per grade 

Number short construct ed-response 
Number multiple-choice 



Number of Questions at Each Grade 
Grade 4 Grade 8 




57 76 

16 23 

41 53 




A 
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Table 2-7 




Cognitive and Noncognitive Block Composition, Grade 4 



Block 


“Etc* 


Total 
Number 
of Items 


Number of 
Multiple- 
Choice Items 


Number of 
Constructed -re ip>aM 

Items 


Booklets 

Containing 

Block 


Short 


Extended 


B1 


Common Background 


20 


20 


0 


m 


1-26 


M2 


Mathematics Background 


18 


18 


0 


II 


1-26 


MB 


Motivation Background 


S 


5 




WBM 


1-26 


M3 


Mathematics Cognitive 


13 


9 


4 


H 


1,13,20 


M4 


Mathematics Cognitive (Trend) 


14 


14 


0 


i® 


1, 2,21 


MS 


Mathematics Cognitive (Trend/Ruler) 




13 


4 


§BX 


2, 3.22 


M6 


Mathematics Cognitive (Trend) 




0 


ii : 


Mm ■§ 


3, 4,23 


M7 


Mathematics Cognitive 




6 


3 


i 


4, 5,24 


M8 


Mathematics Cognitive (Trend/Calculator) 


15 


14 


1 


0 


5, 6,25 


M9 


Mathematics Cognitive 


12 


9 


2 


1 


6, 7,26 


M10 


Mathematics Cognitive (Manipulative^) 


6 


0 


6 


0 


4, 7,21 


MU 


Mathematics Cognitive 


16 


11 


5 


■n 


S, 8,22 


M12 


Mathematics Cognitive (Calculator) 


12 


5 




SP' 


6, 9,23 


M13 


Mathematics Cognitive 


12 


6 






7 : 10, 24 


M14 


Mathematics Cognitive (Calculator) 


10 


6 




1 


3, 11,25 


MIS 


Mathematics Cognitive 


10 


6 




1 


9, 12,26 



Table 2-8 

Cognitive and Noncognitive Block Information, Grade 8 



Block 


Type 


Total 
Number 
of Items 


Number of 
Multiple- 
Choke Items 


Number of 
Constructed-!-* spoon 
Items 


Booklets 

Containing 

Block 


Short 


Extended 


B1 


Common Background 


22 


22 


0 


0 


1-26 


M2 


Mathematics Background 


23 


23 






1-26 


MB 


Motivation Background 


5 


S 


0 




1-26 


M3 


Mathematics Cognitive 


13 


9 


3 


i 


1, 13.20 


M4 


Mathematics Cognitive (Trend) 


21 


21 


0 




1, 2,21 


MS 


Mathematics Cognitive (Tzend/Ruler) 


21 


16 


5 


0 


2, 3,22 


M6 


Mathematics Cognitive (Trend) 


16 


0 


16 


0 




M7 


Mathematics Cognitive 


13 


7 


5 


1 


4, 5,24 


M8 


Mathematics Cognitive (Txend/Calculator) 


18 


16 


2 


0 


5, 6,25 


M9 


Mathematics Cognitive 


9 


5 


3 


1 


6, 7,26 


M10 


Mathematics Cognitive (Manipulative*) 


7 


0 


7 


0 


4, 7,21 


Mil 


Mathematics Cognitive 


19 


13 


6 




5, 8,22 


M12 


Mathematics Cognitive (Calculator) 


9 


6 


2 


1 


6, 9,23 


M13 


Mathematics Cognitive 


11 


6 


4 


1 


7, 10, 24 


M14 


Mathematics Cognitive (Calculator) 


9 


6 


2 


1 


8,11,25 


M15 


Mathematics Cognitive 


17 


13 


4 

. . . 


0 


9,12,26 
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2.8 STUDENT ASSESSMENT BOOKLETS 



The assembly of mathematics items into booklets and their subsequent assignment to 
assessed students was determined by a balanced incomplete block (BIB) design with spiraled 
administration. 

The first step in implementing BIB spiraling required dividing the total pool of 
mathematics items into blocks designed to take 15 minutes to complete. These blocks were then 
assembled into booklets containing two 5-minute background sections, three blocks of 
mathematics items according to a partially balanced incomplete block design, and an additional 
1-minute background section. Thus, the assessment time for each of these student booklets was 
approximately 56 minutes. Following the completion of the assessment booklet, all students 
were given another booklet that contained the paced audiotape block of estimation items which 
took about 15 minutes. Thus, the overall assessment time for each student was approximately 
71 minutes. 

The mathematics blocks were assigned to booklets in such a way that each block 
appeared in the same number of booklets and evety pair of blocks appeared together in exactly 
one booklet This is the balanced part of the balanced incomplete block design. It is an 
incomplete block design because no booklet contained all items and hence there is incomplete 
data for each assessed student 

The BIB design for the 1992 national mathematics assessment (and, therefore, for the 
Trial State Assessment) was focused— each block was paired with every other mathematics block 
but not with blocks from other subject areas. The focused-B>T& design also balances the order of 
presentation of the blocks of items— every block appears as the first cognitive block in one 
booklet, as the second block in another booklet, and as the third block in a third booklet. 

The focused-BIB design used at each grade level in 1992 required that 13 blocks of 
mathematics items be assembled into 26 booklets. The assessment booklets were then spiraled 
and bundled. Spiraling involves interleaving the booklets in a systematic sequence so that each 
booklet appears an appropriate number of times in the sample. The bundles were designed so 
that each booklet would appear equally often in each position in a bundle. 

The final step in the BIB-spiraling procedure is the assigning of the booklets to the 
assessed students. The students within an assessment session were assigned booklets in the 
order in which the booklets were bundled. Thus, students in an assessment session received 
different booklets, and only several students in the session received the same booklet In the 
Trial State Assessment BIB-spiral design, representative and randomly equivalent samples of 
between approximately 200 and 700 students for each jurisdiction responded to each item. 

Table 2-9 provides the total number of booklets, cognitive bloc’ s, and noncognitive blocks used 
for the program. Table 2-9 also provides the details of the focused-BIB design that was used 
with 13 blocks and 26 booklets. 
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Table 2-9 



Booklet Content at Each Grade Level 



Booklet 

Number 


Common 

Background 

Block 


Mathematics 

Background 

Block 


Cognitive Blocks 


Motivation 

Background 

Block 


1 


B1 


M2 


M3 


M4 


M7 


MB 


2 


B1 


M2 


M4 


MS 


MS 


MB 


3 


B1 


M2 


M5 


M6 


M9 


MB 


4 


B1 


M2 


M6 


M7 


M10 


MB 


5 


B1 


M2 


M7 


M8 


Mil 


MB 


6 


B1 


M2 


M8 


M9 


M12 


MB 


7 


B1 


M2 


M9 


M10 


M13 


MB 


8 


B1 


M2 


M10 


Mil 


M14 


MB 


9 


B1 


M2 


MU 


M12 


M15 


MB 


10 


B1 


M2 


M12 


M13 


M3 


MB 


11 


B1 


M2 


M13 


M14 


M4 


MB 


12 


B1 


M2 


M14 


MIS 


MS 


MB 


13 


B1 


M2 


M15 


M3 


M6 


MB 


14 


B1 


M2 


M3 


M5 


M10 


MB 


15 


Bl 


M2 


M4 


M6 


Mil 


MB 


16 


B1 


M2 


M5 


M7 


M12 


MB 


17 


Bl 


M2 


M6 


MS 


M13 


MB 


18 


Bl 


M2 


M7 


M9 


M14 


MB 


19 


Bl 


M2 


M8 


M10 


MIS 


MB 


20 


Bl 


M2 


M9 


MU 


M3 


MB 


21 


Bl 


M2 


M10 


M12 


M4 


MB 


22 


Bl 


M2 


Mil 


M13 


M5 


MB 


23 


Bl 


M2 


M12 


M14 


M6 


MB 


24 


Bl 


M2 


M13 


MIS 


M7 


MB 


25 


Bl 


M2 


M14 


M3 


M8 


MB 


26 


Bl 


M2 


M15 


M4 


M9 


MB 



Blocks M4> MS, M6, and MB are trend blocks from 1990 
Block \i5 requires a protractor/ruler (grade S) or ruler (grade 4) 
Bloats MS, M12, M14 require a calculator 
Block M10 requires geometric shapes/manipulatives 



2.9 QUESTIONNAIRES 

As part of the Trial State Assessment (as well as the national assessment), a series of 
background questionnaires was administered to students, teachers, and school administrators. 
Similar to the development of the cognitive items, the development of the policy issues and 
questionnaire items was an iterative process that involved staff work, field testing, and review by 
external advisory groups and the federal government A Background Panel drafted a set of 
policy issues and made recommendations regarding the design of the questions. They were 
particularly interested in capitalizing on the unique properties of NAEP and not duplicating 
other surveys (e.g., The National Survey of Public and Private School Teachers and 
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Administrators, The School and Staffing Study, and Hie National Educational Longitudinal 
Study). 



The Panel recommended a focused study that addressed the relationship between 
student achievement and instructional practices. For the 1992 assessment, the framework 
focused on five educational areas: instructional content, instructional practices and experiences, 
teacher characteristics, school conditions and context, and conditions beyond school (Le., home 
support, out-of-school activities, and attitudes (NAEP, 1992). Hie items were written by ETS 
staff and reviewed by the Background Panel, representatives from State Education Agencies, the 
National Center for Education Statistics, and the Office of Management and Budget. Hie 
questionnaires were assembled into questionnaires and underwent internal ETS review 
procedures to ensure fairness and quality. They were field tested as part of the February 1991 
field test and reviewed again by the Background Panel, representatives from State Education 
Agencies, the National Center for Education Statistics, and the Office of Management and 
Budget. 



2.9.1 Student Questionnaires 

In addition to the cognitive questions, the 1992 Trial State Assessment included two five- 
minute sets of general and mathematics background questions designed to gather contextual 
information about students, their experiences in mathematics, and their perceptions of the 
subject, and a one-minute set of background questions about the students* motivation regarding 
the assessment. In many cases the questions used were continued from prior assessments, 
especially from the 1990 assessment in order to measure change between 1990 and 1992. 

The student demographics (common core) questionnaire (20 questions at grade 4 and 22 
questions at grade 8) included questions about race/ethnicity, language spoken in the home, 
mother’s and father’s level of education, reading materials in the home, television watching, 
homework, and which parents live at home. This questionnaire was the first section in every 
booklet 

Three categories of information were represented in the second five-minute student 
mathematics questionnaire (18 questions at grade 4 and 23 questions at grade 8): time spent on 
task and mathematics coursework, the nature of students’ mathematics instruction, and students’ 
enjoyment of and confidence in their abilities in mathematics and their perceptions of the 
usefulness of the discipline to their present and future lives. This questionnaire was the second 
section in every booklet. 

The motivation questionnaire (5 questions at each grade level) asked the students 
questions about their perceptions of the difficulty of the assessment, and of how well they did on 
the assessment, and their motivation to do well on the assessment. This questionnaire was the 
last section in every booklet. 
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2.92 Teacher, School, and Excluded Student Questionnaires 

To supplement the information on instruction reported by students, the mathematics 
teachers of the fourth- and eighth-grade students participating in the Trial State Assessment 
were asked to complete a mathematics teacher questionnaire about their instructional practices, 
teaching backgrounds, and characteristics. The teacher questionnaires contained two parts. 4 
The Teacher Questionnaire, Part I: Background and Training (23 questions at grade 4 and 32 
questions at grade 8) included questions pertaining to gender, race/ethnicity, years of teaching 
experience, certification, degrees, major and minor fields study, coursework in education, 
coursework in subject area, in-service training, extent of control over classroom, instruction, and 
curriculum, and availability of resources for their classroom. The Teacher Questionnaire, Part 
IX: Class by Class Mathematics Information (40 questions at grade 4 and 42 questions at grade 
8) pertained to the procedures the teacher uses for each class containing an assessed student 
and included questions on the ability level of students in the class, whether students were 
assigned to the class by ability level, time on task, homework assignments, frequency of 
instructional activities used in class, instructional emphasis given to the topics and skills covered 
in the assessment, and use of particular resources. 

A School Characteristics and Policies Questionnaire was given to the principal or other 
administrator of each school that participated in the Trial State Assessment Program. This 
questionnaire (77 questions at both grades 4 and 8) included questions about background and 
characteristics of school principals, length of school day and year, school enrollment, 
absenteeism, drop-out rates, policies about tracking, curriculum, testing practices and use, special 
priorities and school-wide programs, availability of resources, special services, community 
services, policies for parental involvement, and school-wide problems. 

The Excluded Student Questionnaire was completed by the teachers of those students 
who were selected to participate in the Trial State Assessment sample but who were determined 
by the school to be ineligible to be assessed because they either had an Individualized Education 
Plan (IEP) and were not mainstreamed at least 50 percent of the time, or were categorized as 
limited English Proficient (LEP). This questionnaire asked about the nature of the student’s 
exclusion and the special programs in which the student participated. 

Schools were permitted to exclude certain students from the assessment. The same 
exclusion criteria and rules used in the national assessment were also applied to the Trial State 
Assessment Although the intent was to assess all sampled students, students who were 
identified by school staff as not capable of participating meaningfully were excluded. The NAEP 
guidelines for exclusion are intended to assure uniformity of exclusion criteria from school to 
school as well as from state to state. 



4 Because the Trial State Assessment at grade four included both mathematics and reading, the fourth grade teacher 
questionnaire contained three sections. The first asked about the teachers’ background and training, the second asked 
about classroom information for the mathematics teachers of the students involved in the mathematics assessment, and 
the third asked about classroom information for the reading teachers of the students involved in the reading assessment. 
Mathematics teachers of students participating in the mathematics assessment were asked to complete parts one and two, 
only. 
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Chapter 3 



SAMPLE DESIGN AND SELECTION 

Leyla K. Mohadjer, Keith F. Rust, Valerija Smith, and Jacqueline Seveiynse 

Westat, Inc. 



3,1 Introduction and Overview 

The 1992 Trial State Assessment Program included assessments in eighth-grade 
mathematics, fourth-grade mathematics, and fourth-grade reading. Three representative 
samples of puhlic-school students were drawn in each participating state or territory. Each 
sample was designed to produce aggregate estimates as well as estimates for various 
subpopulations with approximately equal precision for the participating states. The sample for 
the eighth-grade assessment of mathematics consisted of about 2300 eighth-grade students from 
about 100 public schools in each state or territory. Similarly, the samples for the fourth-grade 
assessments in each state consisted of about 2,500 » vrth-graders in mathematics and about 
2,500 in reading, from about 100 public schools in each case. 

The target populations for the 1992 Trial State Assessment Program included only 
students in regular public schools 1 who were enrolled in the fourth or eighth grade at the time 
of assessment. The sampling frame included the public schools having the relevant grade 
(fourth or eighth grade) in each state or territory. The samples were selected based on a two- 
stage sample design-selection of schools within participating states and selection of students 
within schools. The first-stage samples of schools were selected with probability proportional to 
the eighth- or fourth-grade enrollment in the schools to provide efficient sample designs for the 
student populations. Special procedures were used for states with many small schools, and for 
states or territories having a small number of schools for a given grade (see section 3.4.5). 

The sampling frame for each state was first stratified by the urbanization status of the 
area in which the school was located. The urbanization classes were defined in terms of large or 
mid-size central city, urban fringe of large or mid-size city, large town, small town, and rural 
areas (see section 3.4.2). Within urbanization strata, schools were further stratified explicitly on 
the basis of minority enrollment in those states with substantial Black or Hispanic student 



public school is defined as an institution which provides educational services and has one or more grade groups 
(FK-12) or which is ungraded, has one or more teachers to give instruction, is located in one or more buildings, has on 
assigned administrator, receives public funds as primary support, and is operated by an education agency. A regular 
school is a public elementaiy/sccondary school that does not focus primarily on vocational, special, or alternative 
education. 
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populations. Minority enrollment was defined as the total percent of Black and Hispanic 
students enrolled in a school (see section 3.43). Within minority strata, schools were sorted by 
median household income of the ZIP code area where the school was located (see section 3.4.4). 

One of the goals of the 1992 state sample design was to minimize overlap — between the • 
state and national samples, between the state fourth- and eighth-grade sample (in schools that 
had both grades), and with the first phase followup to Prospects: The National Longitudinal Study 
of Chapter l Children (Abt Associates, 1991). 

A systematic random sample of about 100 eighth-grade schools was drawn with 
probability proportional to the eighth-grad s enrollment of the school from the stratified frame 
of schools within each state. Up to three sessions were assigned within each school The 
number of sessions selected in each school was proportional to the eighth-grade enrollment of 
the schools. In those states and territories that had fewer than 100 schools with eighth grade, all 
schools were included in the sample. 

Similarly, systematic random samples of fourth-grade schools were selected with 
probability proportional to the fourth-grade enrollment of the school from the fourth-grade 
sampling frames in the participating states. The number of schools drawn for the fourth-grade 
sample varied by state depending on the distribution of the fourth-grade enrollment in each 
state (see Table 3-3). In those states and territories that had fewer than 100 schools with fourth 
grade, all schools were included in the sample. 

Successive schools were paired, using the same order in which they were selected, and 
one member of each pair was designated at random to be monitored during the assessment by 
Westat field staff so that reliable comparisons could be made between sessions administered 
with and without monitoring. 

Both reading and mathematics sessions were conducted in fourth-grade sampled schools 
in which there were more than 20 students. Schools that had no more than 20 fourth-grade 
students were randomly assigned to administer either reading or mathematics. Approximately 
2£00 students were assessed for each subject and each grade in a given state. Except in the two 
small territories, about 5,000 fourth-grade students participated in the assessment On average, 
128 fourth-grade schools were sampled in each state (in which sampling of schools was 
conducted) with about 115 conducting both mathematics and reading assessments, and about 13 
conducting only mathematics or reading. The maximum number of schools selected in a state 
was 200. 

Each selected school provided a list of eligible enrolled students, from which a systematic 
sample of students was drawn. Thirty students were selected for each session from grade 8 
student lists, and 60 students were selected from grade 4 student lists. All students were 
selected if there were less than 30 grade 8 or less than 60 grade 4 students on the lists. Selected 
students within each of the fourth-grade schools were alternately assigned to either the 
mathematics or the reading assessments. 

The 1992 assessment was preceded in 1991 by a field test, the principal goals of which 
were to test procedures and new items contemplated for the 1992 assessment. Three states and 
one territory also used the field test to observe and react to proposed strategies. Twenty-four 
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jurisdictions participated in the field test. Schools that participated in the field test were given a 
chance of selection in the 1992 assessment, and there was no attempt to control the overlap 
between the school samples for the 1991 field test and those for the 1992 assessment Section 
3.2 documents the procedures used to select the schools for the field test 

Section 33 describes the construction of the sampling frames, including the sources of 
school data, missing data problems, and definition of in-scope schools. Section 3.4 includes a 
description of the various steps in stratification of schools within participating states. School 
sample selection procedures (including new and substitute schools) are described in section 35. 
Section 3.6 includes the steps involved in selection of students within participating schools. 



33 SAMPLE SELECTION FOR THE 1991 FIELD TEST 

The Trial State Assessment 1991 field test was conducted together with the field test for 
the national portion of the assessment. Twenty-four states participated in the field test, which 
was conducted for grades 4, 8, and 12. Pairs of schools were identified, with one of each pair to 
be included in the test. This allowed state participation in the selection of the test schools and 
also facilitated replacement of schools that declined to participate in the assessment. Sampling 
weights were not computed for the field test samples. 



33.1 Primary Sampling Units 

The frame of field-test PSUs was derived from the frame of NAEP FSUs 2 , splitting 
PSUs where necessary in such a way that each of the new PSUs was completely contained within 
a single state. Each state was stratified by urbanization/minority. The sample $i*es were 
assigned in such a way that for each NAEP region the sample sizes were proportional to the 
population of the participating states. Two PSUs were selected from each state. From each of 
the state strata, once the sample was assigned, the PSUs were selected with probability 
proportional to the 1980 population counts. The PSUs selected as noncertainties in the NAEP 
1990 national sample were excluded from the PSU frame to avoid undue burden on the schools 
and districts in these PSUs. Controlled selection (Kish, 1965, pp. 488-95) of PSUs was used to 
achieve the selection of two PSUs per state, assigned proportionately among strata within each 
region. 



Since two PSUs were selected for each of the participating states, the sample assignment 
was not proportional to the population counts. Overall, within each region, the assignment of 
PSUs was proportional to the urbanization/minority stratum population in each region, where 
the urbanization/minority stratum population distribution was based only upon the participating 
states, with each state contributing equally. So, for example, the rural population had 
disproportionately higher representation in the field test than in the general population, since 
many of the participating states were relatively rural in nature. 



*The frame of NAEP PSUs was the frame used to draw the national NAEP samples for 1986 to 1992. Refer to the 
1990 national technical report (Johnson & Allen, 1992) for more information. 
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322 Selection of Schools and Students 

Public schools with fourth- and eighth-grade students were in scope for the assessment 
Schools with fewer than 25 students per grade were eliminated from the frame, to eliminate the 
relatively high cost per student of conducting assessments in small schools. 

The selection of schools avoided overlap with schools that had been selected from the 
certainty PSUs for the 1990 NAEP national sample and the IEA Reading Literacy Study, 
conducted for the National Center for Education Statistics (Rust & Bryant, 1991). Also, there 
was no overlap among the different grade samples. 

For each grade, from each PSU, a sample of five schools was selected with probability 
proportional to the grade enrollment In the states where one PSU had fewer than five schools, 
the sample from the other PSU in the state was increased so that the overall state sample was 
still 10 schools per grade. For each school, where the size of the PSU allowed, we selected the 
second member of each pair in such a way that the “distance" from the primary selection, based 
on percent of Black students, percent of Hispanic students, grade enrollment, and percent of 
students living below the poverty line, was the smallest The overlap of samples was avoided by 
first selecting the twelfth-grade sample (for the national NAEP field test), then eliminating the 
selected schools from the eighth-grade sample selection, and then eliminating the twelfth- and 
eighth-grade selections before selecting the fourth-grade sample. 

From each of the 10 schools selected for the eighth-grade sample, two classrooms were 
randomly selected from each of the five largest schools and one from each of the remaining 
schools. In the fourth-grade sample, two classrooms were selected randomly from each of the 
three largest schools and one classroom from each of the remaining seven schools. An 
exception was made in the fourth-grade samples in Florida, Kentucky, and Wisconsin, where 50 
students were sampled from each of the three largest schools and 25 students from each of the 
remaining schools (unless the number of students was fewer than 35, in which case all of the 
them were taken in the sample). These three states wished to try out the student sampling 
procedures proposed for the 1992 assessment, and so did not use samples of intact classrooms. 



3.23 Assignment to Sessions for Different Subjects 

Three types of sessions were assigned for the field test: print-administered mathematics, 
audiotape-administered mathematics, and print-administered reading and writing. At grade 4, 
one classroom (session) per PSU was selected with equal probability to be administered the 
print-administered mathematics assessment in all states but Florida, Kentucky, and Wisconsin, 
for a total of 61 such sessions. The remaining 228 sessions were assigned to reading and 
writing, from which 15 sessions were selected for audiotaped mathematics sessions with equal 
probability after implicitly stratifying by geographic and urbanization/minority characteristics. In 
Florida, Kentucky, and Wisconsin, where samples of 50 students were drawn from the selected 
schools, the sample was randomly split in two equil sessions. Half of the sessions were 
randomly assigned to the print-administered mathematics assessment and the rest to the reading 
assessment Florida, Kentucky, Wisconsin, and the Virgin Islands did not participate in any 
audiotaped mathematics sessions or writing sessions, since those two components were not 
planned to be part of the 1992 Trial State Assessment. 
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At grade 8, 50 sessions (classrooms) — two per state— were selected to conduct the print- 
administered mathematics assessment. The remaining 267 sessions were assigned to reading 
and writing, from which 13 sessions were assigned to tape-administered mathematics, 
subsampled with equal probability after implicitly stratifying by geographic and 
urbanization/minority characteristics. The Virgin Islands was the only jurisdiction at grade 8 
that did not participate in the assignment of audiotaped sessions. 



33 SAMPLING FRAME FOR THE 1992 ASSESSMENT 
33.1 Choice of School Sampling Frame 

In order to draw the school samples for the 1992 Trial State Assessment, it was 
necessary to obtain a comprehensive list of public schools in each state. For each school, we 
needed useful information for stratification purposes, reliable information about grade span and 
enrollment, and accurate information for identifying the school to the state coordinator (district 
membership, name, address). 

Based on our experience with the 1990 Trial State Assessment, and national assessments 
in 1984, 1986, 1988, and 1990, we elected to use the file made available by Quality Education 
Data, Inc. (QED). We used the National Center for Education Statistics' Common Core of 
Data (CCD) school file to check the completeness of the QED file. This approach differed 
from that used to develop frames for the 1990 Trial State Assessment, for wliich the CCD was 
used primarily. There were several reasons for this change. 

For 1992, it was possible to obtain a version of the QED file that contained all of the 
relevant variables from the most current CCD file. This meant in particular that data on 
minority enrollment by school, an important school stratification variable, were available on the 
QED file. These data had been available only on u.w CCD for the 1990 assessment In 
addition, "type of locale," a seven-level urbanization variable newly created by the National 
Center for Education Statistics, was available on the QED (as well as the CCD) for 1992. Our 
experience in 1990 indicated that, generally speaking, the updatedness of the school lists and the 
quality of name and address information was both higher overall and more uniform on the 
QED. This is important for three reasons: 1) an outdated list leads to the selection of 
relatively many out-of-scope schools and greater reliance on new school sampling procedures; 2) 
poor quality name and address information leads to errors in the identification of sample 
schools by state coordinators (some schools on the CCD in 1990 had no dty name as part of the 
address, for example); and 3) good quality ZIP codes are needed to give good stratification by 
household income (see section 3.4.4). 

Tims, the combination of these factors led us to choose the QED file as the basis of the 
frame for each state. The QED list covers all states and territories except Puerto Rico (which 
did not participate). The version of the QED file used was released in late 1990, in time for 
selection of the school sample in early 1991. The file was missing minority and urbanization 
data for a sizable minority of schools (due to the inability of QED to match these schools with 
the corresponding CCD file). We undertook considerable efforts to obtain these variables for 
all schools in states where these variables were to be used for stratification. These efforts are 
described in the next section. 



r. , 

t .) > 
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Tables 3-1 and 3-2 show the distribution of fourth- and eighth-grade schools, and 
enrollment within schools as reported in the 1990 QED file. Enrollment was estimated for each 
grade as the ratio of total school enrollment by the number of grades in the schooL Refer to 
section 3.4.5 for the definition of small school duster type. Schools with fewer than 20 students 
are denoted as small schools. Large schools are those with 20 or more students in the 
associated grade. 



3 32 Missing Minority and Urbanization (Type of Locale) Data 

As stated earlier, the sampling frame for the 1992 Trial State Assessment was the most 
recent version of the QED file, as of January 1991. The CCD file was used to extract 
information on minority and urbanization in the cases where these variables were missing on the 
QED file. The minority data were extracted only for those schools in states in which minority 
stratification was performed. In cases where urbanization could not be determined from the 
CCD file, the three-level dassification of urban/suburban/rural (available for all schools on the 
QED file) was used to impute for urbanization. 



3 33 In-scope Schools 

The target population for the 1992 Trial State Assessment Program induded students in 
regular public schools who were enrolled in the eighth grade or fourth grade. Parochial, private, 
Bureau of Indian Affairs, Department of Defense, and special education schools were not 
induded. 



3.4 WITHIN-STATE STRATIFICATION 
3.4.1 Stratification Variables 

Selection of schools within participating states involved three stages of explicit 
stratification and one stage of implicit stratification. The first three stages were school size 
(where size was the grade level enrollment of the schools), urbanization, and minority 
enrollment The final stage was median income. The stratification methods described below 
applied to both fourth- and eighth-grade. 

The first stage of stratification applied only to states with relatively many students in 
small schools. These states were known as Cluster Type 3 states. The schools were stratified 
into two strata, one stratum consisting of schools with 20 or more fourth-grade (or eighth-grade) 
students, and another stratum consisting of all schools with fewer than 20 students in the fourth 
(or eighth) grade. The primary purpose of this stratification was to ensure that the sample of 
schools would provide an appropriate student sample size. It also ensured appropriate 
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Tabic 3-1 

Distribution of Fourth-grade Schools and Enrollment as Reported in QED 1990 




State 


Small School 
Ouster Type 


Total 

Schools 


Small 

Schools 


Large 

Schools 


Total 

Enrollment 


Small School 
Enrollment 


Alabama 


Geographic 


786 


29 


757 


59427 


438 


Arizona 


Geographic 


637 


57 


580 


51361 


505 j 


Arkansas 


Geographic 


550 


40 


510 


35,107 


606 


California 


Geographic 


4,610 


299 


4311 


383365 


2330 


Colorado 


Geographic 


752 


84 


66 8 


45345 


860 


Connecticut 


Geographic 


563 


11 


552 


37,069 


163 


Delaware 


None 


54 


2 


52 


6342 


32 


District of Columbia 


None 


118 


3 


115 


6306 


34 


Florida 


Geographic 


1321 


13 


1308 


144,789 


191 


Georgia 


Geographic 


1,021 


11 


1,010 


94,572 


178 


Guam 


None 


21 


0 


21 


2,115 


0 


Hawaii 


Geographic 


170 


3 


167 


14,070 


21 


Idaho 


Stratified 


304 


44 


258 


18,069 


385 


Indiana 


Geographic 


1,167 


21 


1,146 


75,807 


339 


Iowa 


Stratified 


794 


84 


710 


37,786 


1336 


Kentucky 


Geographic 


832 


51 


781 


50356 


753 


Louisiana 


Geographic 


788 


44 


744 


62,780 


627 


Maine 


Stratified 


405 


122 


283 


16,616 


1358 


Maryland 


Geographic 


755 


12 


743 


54316 


155 


Massachusetts 


Geographic 


1,038 


28 


1,010 


64374 


390 


Michigan 


Geographic 


1,876 


62 


1314 


123,028 


S71 


Minnesota 


Geographic 


838 


66 


772 


58,711 


956 


Mississippi 


Geographic 


465 


3 


462 


41,063 


46 


Missouri 


Stratified 


1,093 


147 


946 


63355 


1,728 


Nebraska 


Stratified 


1,011 


615 


396 


21,834 


3326 


New Hampshire 


Stratified 


268 


55 


213 


13,721 


654 


New Jersey 


Geographic 


1338 


42 


1396 


84,148 


639 


New Mexico 


Stratified 


378 


57 


321 


24316 


673 


New York 


Geographic 


2359 


44 


2315 


191373 


565 


North Carolina 


Geographic 


1,109 


25 


1,084 


85,158 


361 


North Dakota 


Stratified 


359 


180 


179 


9,973 


1328 


Ohio 


Geographic 


2,039 


44 


1,995 


136,626 


651 


Oklahoma 


Stratified 


973 


216 


757 


48317 


2,696 


Pennsylvania 


Geographic 


1,879 


47 


1,832 


126,166 


727 


Rhode Island 


Geographic 


177 


2 


175 


114M 


28 


South Carolina 


Geographic 


S52 


4 


548 


49,117 


50 


Tennessee 


Geographic 


933 


66 


867 


66,932 


900 


Texas 


Geographic 


3,053 


238 


2315 


268,796 


2396 


Utah 


Geographic 


432 


31 


401 


36,629 


260 


Virginia 


Geographic 


1,041 


39 


1,002 


80386 


523 


Virgin Islands 


None 


24 


1 


23 


1374 


15 


West Virginia 


Stratified 


637 


104 


533 


25332 


1,474 


Wisconsin 


Stratified 


1,147 


128 


1,019 


59,965 


1,910 




Stratified 


238 


91 


147 


8,050 


528 
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Tabic 3-2 

Distribution of Eighth-grade Schools and Enrollment as Reported in QED 1990 



State 


Small School 
Cluster Type 


Total 

Schools 


Small 

Schools 


large 

Schools 


Total 

Enrollment 


Small School 
Enrollment 


Alabama 


Geographic 


497 


15 


482 


55,735 


231 


Arizona 


Geographic 


303 


42 


261 


44,533 


349 


Aricansar 


Geographic 


358 


32 


326 


34,237 


461 


California 


Geographic 


1,594 


214 


1,380 


330,433 


2,023 


Colorado 


Geographic 


319 


62 


257 


40,763 


637 


Connecticut 


Geographic 


214 


6 


208 


31,483 


63 


Delaware 


None 


28 


2 


26 


6,482 


32 


District of Columbia 


None 


35 


0 


35 


5,361 


0 


Florida 


Geographic 


442 


7 


435 


125,900 


84 


Georgia 


Geographic 


393 


1 


392 


86,778 


10 


Guam 


None 


6 


0 


6 


L862 


0 


Hawaii 


None 


57 


2 


55 


12,053 


21 


Idaho 


Geographic 


154 


29 


125 


16,243 


265 


Indiana 


Geographic 


444 


2 


442 


72326 


27 


Iowa 


Geographic 


455 


37 


418 


36,272 


467 


Kentucky 


Geographic 


432 


34 


398 


47,605 


501 


Louisiana 


Geographic 


440 


45 


395 


57,168 


648 


Maine 


Stratified 


235 


68 


167 


15,713 


713 


Maryland 


Geographic 


216 


4 


212 


48,408 


41 


Massachusetts 


Geographic 


385 


4 


381 


58,519 


22 


Michigan 


Geographic 


748 


42 


706 


113,633 


331 


Minnesota 


Geographic 


441 


30 


411 


53,079 


444 


Mississippi 


Geographic 


308 


2 


306 


37,965 


31 


Missouri 


Stratified 


640 


131 


509 


58,673 


1319 


Nebraska 


Stratified 


706 


502 


204 


19,986 


2,636 


New Hampshire 


Geographic 


134 


16 


118 


12,787 


184 


New Jersey 


Geographic 


678 


23 


655 


81352 


332 


New Mexico 


Geographic 


150 


28 


122 


21, 111 


310 


New York 


Geographic 


998 


15 


983 


185,484 


196 


North Carolina 


Geographic 


556 


17 


539 


84,003 


225 


North Dakota 


Stratified 


272 


167 


105 


8,809 


1355 


Ohio 


Geographic 


846 


16 


830 


129,321 


180 


Oklahoma 


Stratified 


642 


207 


435 


44,121 


2359 


Pennsylvania 


Geographic 


722 


2 


720 


122,456 


19 


Rhode Island 


None 


54 


1 


53 


9,765 


9 


South Carolina 


Geographic 


256 


1 


255 


47,670 


12 


Tennessee 


Geographic 


549 


47 


502 


63,532 


611 


Texas 


Geographic 


1,464 


210 


1,254 


242358 


2367 


Utah 


Geographic 


142 


17 


125 


31340 


162 


Virginia 


Geographic 


335 


5 


330 


72407 


67 


Virgin Islands 


None 


6 


0 


6 


1,960 


0 


West Virginia 


Geographic 


252 


14 


238 


25206 


204 


Wisconsin 


Geographic 


517 


42 


475 


54,906 


605 


Wyoming 


Stratified 


108 


47 


61 


7358 


267 
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representrtion of small schools In states with any substantial number of such schools. Tables 3-3 
and 3-4 provide the type of stratification used in each of the participating states or territories 
respectively for fourth- and eighth-grade samples. Refer to section 3.4.2 for the definition of 
urbanization and section 3.4.3 for the definition of minority. 



3.4.2 Urbanization Classification 

The NCES "type of locale" variable was used to stratify schools into different 
urbanization levels. Stratification by type of locale was repeated separately for fourth and eighth 
grade. The seven categories of the variable are defined as follows. 

1) Large Centred City: a central city of a Metropolitan Statistical Area (MSA) with a 
population greater than or equal to 400,000, or a population density greater than 
or equal to 6,000 persons per square mile. 

2) Mid-size Central City : a central city of an MSA but not designated as a large 
central city. 

3) Urban Fringe of Large City: a place within an MSA of a large central city and 
defined as urban by the U.S. Bureau of Census. 

4) Urban Fringe of Mid-size City: a place within an MSA of a mid-size central city 
and defined as urban by the U.S. Bureau of Census. 

5) Large Town: a place not within an MSA, but with a population greater than or 
equal to 25,000 and defined as urban by the U.S. Bureau of Census. 

6) Small Town: a place not within an MSA, but with a population less than 25,000 
and defined as urban by U.S. Bureau of Census. 

7) Rural: a place with a population of less than 2,500 and defined as rural by the 
U.S. Bureau of the Census. 

The urbanization strata were created by collapsing type of locale categories. The nature 
of the collapsing varied across states and grades. At a minimum, each urbanization stratum 
included 10 percent of eligible students in the participating state. Tables 3-3 and 3-4 provide the 
urbanization categories (created by collapsing type of locale) used within each state. 



3.43 Minority Classification 

The third stage of stratification was minority enrollment. Minority enrollment strata 
were formed within urbanization strata, based on the percentages of Black and Hispanic 
students. The three cases that occur are described in the following paragraphs. 

Case 1: Urbanization strata with less than 10 percent Black students and 7 percent 
Hispanic students were not stratified by minority enrollment. 
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Table 3-3 

Distribution of the Selected Schools by Sampling Strata, Grade 4 



Urbanisation 



Minority Sthflali iB-Sttata 



ALABAMA (Smalt School Clutter Type 2 - Geographic) 



Mid-size Central City 


Low Percent Minority 


9 


Mid-size Central City 


Medium Percent Minority 


9 


Mid-size Central City 


High Percent Minority 


S 


Urban Fringe of Mid-iize Central City 


Low Percent Minority 


10 


Urban Fringe of Mid-size Central City 


Medium Percent Minority 


10 


Urban Fringe of Mid-size Central City 


High Percent Minority 


9 


Large/Small Town 


High Percent Minority 


9 


Large/Small Town 


Low Percent Minority 


9 


Large/Small Town 


Medium Percent Minority 


9 


Rural 


Low Percent Minority 


16 


Rural 


Medium Percent Minority 


-14 

112 


ARIZONA (Small School Clutter Type 2 - Geographic) 
Large Central City 


Low Percent Minority 


9 


Large Central City 


Medium Percent Minority 


8 


Large Central City 


High Percent Minority 


9 


Mid-size Central City 


Low Percent Minority 


10 


Mid-size Central City 


Medium Percent Minority 


10 


Mid-size Central City 


High Percent Minority 


9 


Urban Fringe of Large Central City 


Low Percent Minority 


6 


Urban Fringe of Large/Mid-sizc Central City 


Medium Percent Minority 


7 


Urban Fringe of Laige/Mid-sizc Central City 


High Percent Minority 


6 


Large/Small Town and Rural 


Low Percent Minority 


13 


Large/Small Town and Rural 


Medium Percent Minority 


13 


Large/Small Town and Rural 


High Percent Minority 








110 


ARKANSAS (Small School Clutter Type 2 - Geographic) 
Mid-size Central City + Urban Fringe 


Low Percent Minority 


10 


Mid-size Central City + Urban Fringe 


Medium Percent Minority 


10 


Mid-size Central City + Urban Fringe 


High Percent Minority 


10 


L/SmaH Town 


Low Percent Minority 


15 


L/Smali Town 


Medium Percent Minority 


14 


L/Small Town 


High Percent Minority 


15 


Rural 


Low Percent Minority 


19 


Rural 


Medium Percent Minority 


15 


Rural 


High Percent Minority 


-16 

124 
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Table 3-3 (continued) 

Distribution of the Selected Schools by Sampling Strata, Grade 4 



Urbanization 


Minority 


Schools In ! 


CALIFORNIA (Small School Cluster Type 2 - Geographic) 
Large/Mid-size Central City 


Low Percent Minority 


13 


Large /Mid-size Central City 


Medium Percent Minority 


12 


Large/Mid-size Central City 


High Percent Minority 


13 


Urban Fringe of Large Central City 


Lew Percent Minority 


10 


Urban Fringe of Large Central City 


Medium Percent Minority 


11 


Urban Fringe of Large Central City 


High Percent Minority 


11 


Urban Fringe of Mid-size Central City 


Low Percent Minority 


5 


Urban Fringe of Mid-size Central City 


Medium Percent Minority 


5 


Urban Fringe of Mid-size Central City 


High Percent Minority 


4 


Large/SmaU Town and Rural 


Low Percent Minority 


13 


Large/SmaU Town and Rural 


Medium Percent Minority 


9 


Large/SmaU Town and Rural 


High Percent Minority 


7 

113 


COLORADO (Small School Cluster Type 2 - Geographic) 
Large/Mid-size Central City 


Low Percent Minority 


12 


Large/Mid-size Central City 


Medium Percent Minority 


11 


Large/Mid-size Central City 


High Percent Minority 


11 


Urban Fringe of Large/Mid-size Centra! City 


Low Percent Minority 


14 


Urban Fringe of Large/Mid-size Central City 


Medium Percent Minority 


14 


Urban Fringe of Large/Mid-size Central City 


High Percent Minority 


14 


Large/SmaU Town 


Low Percent Minority 


7 


Large/SmaU Town 


Medium Percent Minority 


7 


Large/SmaU Town 


High Percent Minority 


6 


Rural 


Low Percent Minority 


11 


Rural 


Medium Percent Minority 


9 


Rural 


High Percent Minority 


127 


CONNECTICUT (Small School Cluster Type 2 - Geographic) 
Large Central City 


Low Black/Lcw Hispanic 


5 


Large Central City 


Low Black/High Hispanic 


4 


Large Central City 


High Black/Low Hispanic 


4 


Large Central City 


High Black/High Hispanic 


4 


Mid-size Central City 


Low Percent Minority 


7 


Mid-size Central City 


Medium Percent Minority 


7 


Mid-size Central City 


High Percent Minority 


7 


Urban Fringe of Large/Mid-size Central City 


None 


34 


Large/Small Town and Rural 


None 


-32 

Ill 
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Table 3-3 (continued) 

Distribution of the Selected Schools by Sampling Strata, Grade 4 




Urbanization 


Minority Sc 


bools In Strata 


DELAWARE (Small School Cluster Type 1 - None) 
Large/Mid-size Central City 


Low Percent Minority 


3 


Large/Mid-size Central City 


Medium Percent Minority 


4 


Large/Mid-size Central City 


High Percent Minority 


5 


Urban Fringe of Mid-size Central City 


Low Percent Minority 


1 


Urban Fringe of Mid-size Central City 


Medium Percent Minority 


2 


Urban Fringe of Mid-size Central City 


High Percent Minority 


3 


Small Town 


Low Percent Minority 


2 


Small Town 


Medium Percent Minority 


3 


Small Town 


High Percent Minority 


1 


Rural 


Low Percent Minority 


7 


Rural 


Medium Percent Minority 


8 


Rural 


High Percent Minority 


8 

47 


DISTRICT OF COLUMBIA (Small School Cluster Type 1 - 


None) 




Large Central City 


Medium Percent Minority 


44 


Large Central City 


High Percent Minority 


69 

113 


FLORIDA (Small School Cluster Type 2 - Geographic) 
Large Central City 


Low Bladc/Low Hispanic 


4 


Large Central City 


Low Black/High Hispanic 


4 


Large Central City 


High Bladc/Low Hispanic 


4 


Large Central City 


High Black/High Hispanic 


4 


Mid-size Central City 


Low Percent Minority 


6 


Mid-size Central City 


Medium Percent Minority 


7 


Mid-size Central City 


High Percent Minority 


7 


Urban Fringe of Large/Mid-size Central City 


Low Percent Minority 


16 


Urban Fringe of Large/Mid-size Central City 


Medium Percent Minority 


16 


Urban Fringe of Large/Mid-size Central City 


Hiqh Percent Minority 


15 


Large/Small Town and Rural 


Low Percent Minority 


8 


Laige/Small Town and Rural 


Medium Percent Minority 


8 


Large/Small Town and Rural 


High Percent Minority 


7 

106 


GEORGIA (Small School Cluster Type 2 - Geographic) 
Large/Mid-size Central City 


Low Percent Minority 


8 


Large/Mid-sizc Central City 


Medium Percent Minority 


8 


Large/Mid-size Central City 


High Percent Minority 


8 


Urban Fringe of Large/Mid-size Central City 


Low Percent Minority 


10 


Urban Fringe of Large/Mid-size Central City 


Medium Percent Minority 


11 


Urban Fringe of Laige/Mid-size Central City 


High Percent Minority 


10 


Large/Small Town 


Low Percent Minority 


11 


Large /Small Town 


Medium Percent Minority 


11 


Large/Small Town 


High Percent Minority 


10 


Rural 


Low Percent Minority 


7 


Rural 


Medium Percent Minority 


6 


Rural 


High Percent Minority 


_z 






107 
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Tabic 3-3 (continued) 

Distribution of the Selected Schools by Sampling Strata, Grade 4 





Minority 


Schools In Strata 


GUAM (Small School Ouster Type i - None) 


Rural 


None 


21 


HAWAII (Small School Ouster Type 2 - Geographic) 


Mid-size Central City 


None 


33 


Urban Fringe of Mid-size Central City 


None 


51 


Large/Small Town and Rural 

IDAHO (Small School Ouster Type 3 - Stratified) 
Large Schools 


None 


-2* 

107 


Mid-size Central City and Urban Fringe 


None 


22 


Large Town 


None 


19 


Small Town 


None 


35 


Rural 

Small Schools 


None 


39 


None 


None 


129 


INDIANA (Small School Ouster Type 2 - Geographic) 


Large /Mid-size Central City 


Low Percent Minority 


12 


Large /Mid-size Central City 


Medium Percent Minority 


11 


Large/Mid-size Central City 


High Percent Minority 


10 


Urban Fringe of Large/Mid-size Central City 


I/Ow Percent Minority 


13 


Urban Fringe of Large/Mid-size Central Gty 


Medium Percent Minority 


11 


Rural 


None 


26 


Large/Small Town 

IOWA (Small School Cluster Type 3 - Stratified) 
Large Schools 


None 


116 


Mid-size Central City and Urban Fringe 


None 


38 


Large/Small Town 


None 


40 


Rural 

Small Schools 


None 


47 


None 


None 


_i4 

139 


KENTUCKY (Small School Ouster Type 2 - Geographic) 


Mid-size Central City 


Low Percent Minority 


6 


Mid-size Central City 


Medium Percent Minority 


7 


Mid-size Central City 


High Percent Minority 


6 


Urban Fringe of Mid-size Central City 


Low Percent Minority 


9 


Urban Fringe of Mid-size Central City 


Medium Percent Minority 


8 


Rural 


None 


SI 


Large/Small Town 


None 


123 
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Table 3-3 (continued) 

Distribution of the Selected Schools by Sampling Strata, Grade 4 



Urbanisation 


Minority 


Schools In Strata 


LOUISIANA (Small School Ouster Type 2 - Geographic) 


Large /Mid-size Central City 


Lew Percent Minority 


11 


Large /Mid-size Central City 


Medium Percent Minority 


11 


Large/Mid-size Central City 


High Percent Minority 


12 


Urban Fringe of Large/Mid-size Central City 


Low Percent Minority 


r 


Urban Fringe of Large/Mid-size Central City 


Medium Percent Minority 


1 


Urban Fringe of Large/Mid-size Central City 


High Percent Minority 


6 


Large/Small Town 


Low Percent Minority 


11 


Large/Small Town 


Medium Percent Minority 


11 


Large/Small Town 


High Percent Minority 


11 


Rural 


Low Percent Minority 


8 


Rural 


Medium Percent Minority 


11 


Rural 

MAINE (Small School Ouster Type 3 - Stratified) 
Large Schools 


High Percent Minority 


9 

114 


Mid-size Central City and Urban Fringe 


None 


21 


Small Town 


None 


58 


Rural 

Small Schools 


None 


45 


None 


None 


J& 

163 


MARYLAND (Small School Ouster Type 2 - Geographic) 


Large/Mid-size Central City 


Low Percent Minority 


7 


Large/Mid-size Central City 


Medium Percent Minority 


6 


Large/Mid-size Central City 


High Percent Minority 


7 


Urban Fringe of Large/Mid-size Central City 


Low Percent Minority 


21 


Urban Fringe of Large/Mid-size Central City 


Medium Percent Minority 


22 


Urban Fringe of Large/Mid-size Central City 


High Percent Minority 


21 


Large/Small Town and Rural 


Low Percent Minority 


14 


Large/Smafl Town and Rural 


Medium Percent Minority 


_12 

110 


MASSACHUSETTS (Small School Ouster Type 2 - Geographic) 


Large/Mid-size Central City 


Low Percent Minority 


14 


Large/Mid-size Central City 


Medium Percent Minority 


13 


Laige/Mid-sfee Central City 


High Percent Minority 


13 


Urban Fringe of Large/Mid-size Central City 


None 


40 


Large/Small Town and Rural 


None 


120 


MICHIGAN (Small School Ouster Type 2 - Geographic) 


Large/Mid-size Central City 


Low Percent Minority 


9 


Large/Mid-size Central City 


Medium Percent Minority 


8 


Large/Mid-size Centra! City 


High Percent Minority 


9 


Urban Fringe of Large/Mid-size Central City 


None 


38 


Rural 


None 


20 


Large /Small Town 


None 


—10 

114 
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Table 3-3 (continued) 

Distribution of the Selected Schools by Sampling Strata, Grade 4 



Urbanization Minorite Schools In Strata 



MINNESOTA (Small School Duster Type 2 - Geographic) 



Large Central City 


Medium Percent Minority 


5 


Mid-size Central City 


Low Percent Minority 


7 


Urban Fringe of Large/Mid- size Central City 


None 


36 


Rural 


None 


41 


Large/Smad Town 


None 


115 


MISSISSIPPI (Small School Duster Type 2 - Geographic) 


Mid-size Central City 


Low Percent Minority 


4 


Mid-size Central City 


Medium Percent Minority 


5 


Mid-size Central City 


High Percent Minority 


4 


Urban Fringe of Large/Mid-size Central City 


Low Percent Minority 


3 


Urban Fringe of Large/Mid-size Central City 


Medium Percent Minority 


3 


Urban Fringe of Large/Mid-size Central City 


High Percent Minority 


3 


Large/Small Town 


Low Percent Minority 


15 


Large/Small Town 


Medium Percent Minority 


15 


Large/Small Town 


High Percent Minority 


15 


Rural 


Low Percent Minority 


13 


Rural 


Medium Percent Minority 


13 


Rural 


High Percent Minority 


_1S 

108 



MISSOURI (Small School Duster Type 3 - Stratified) 
Large Schools 



Large/Mid-size Central City 


Low Percent Minority 


5 


Large/Mid-size Central City 


Medium Percent Minority 


6 


Large/Mid-size Central City 


High Percent Minorite 


3 


Urban Fringe of Large Central City 


High Percent Minority 


13 


Urban Fringe of Large/Mid-size Central City 


Low Percent Minority 


13 


Urban Fringe of Large/Mid-size Central City 


Medium Percent Minority 


13 


Large/Small Town 


None 


24 


Rural 

Small Schools 


None 


33 


None 

NEBRASKA (Small School Duster Type 3 - Stratified) 
Large Schools 


None 


-12 

123 


Mid-size Central City and Urban Fringe 


None 


43 


Large/Small Town 


None 


37 


Rural 

Small Schools 


None 


41 


None 


None 


187 
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Table 3-3 (continued) 

Distribution of the Selected Schools by Sampling Strata, Grade 4 



Urbanisation 

NEW HAMPSHIRE (Small School Cluster Type 3 - Stratified) 
Large Schools Mid-size Central City and 


Minority 


ScfrMli-feL 


Urban Fringe 


None 


26 


Large/Small Town 


None 


S7 


Rural 


None 


29 


Small Schools None 


None 


24 

136 


NEW JERSEY (Small School Cluster Type 2 - Geographic) 


Large/M:d-size Central City 


Low Black/Low Hispanic 


6 


Large/Mid-size Central City 


Low Black /High Hispanic 


S 


Large /Mid-size Central City 


High Black/Low Hispanic 


5 


Large/Mid-size Central City 


High Black/High Hispanic 


5 


Urban Fringe of Large Central City 


Low Percent Minority 


28 


Urban Fringe of Large Central City 


Medium Percent Minority 


17 


Urban Fringe of Mid-size Central City 


None 


25 


Large/Small Town and Rural 

NEW MEXICO (Small School Cluster Type 3 - Stratified) 
Large Schools 


None 


_2§ 

119 


Mid-size Central City and Urban Fringe 


Low Percent Minority 


14 


Mid-size Central City and Urban Fringe 


Medium Percent Minority 


14 


Mid-size Central City and Urban Fringe 


High Percent Minority 


14 


Large Town 


Low Percent Minority 


5 


Large Town 


Medium Percent Minority 


5 


Large Town 


High Percent Minority 


6 


Small Town 


Lew Percent Minority 


10 


Small Town 


Medium Per cor Minority 


10 


Small Town 


High Percent Minority 


11 


Rural 


Low Percent Minority 


5 


Rural 


Medium Percent Minority 


7 


Rural 

Small Schools 


High Percent Minority 


£ 


None 


None 


120 


NEW YORK (Small School Cluster Type 2 - Geographic) 


Large/Mid- size Central City 


H% Black/High Hispanic 


11 


Large/Mid- size Central City 


Low Black/Low Hispanic 


12 


Large/Mid- size Central City 


Low Black/High Hispanic 


12 


Large/Mid- size Central City 


High Black/Low Hispanic 


12 


Urban Fringe of Large Central City 


None 


13 


Urban Fringe of Mid-size Central City 


None 


18 


Large/Small Town and Rural 


None 


_22 

110 



50 





Table 3-3 (continued) 

Distribution of the Selected Schools by Sampling Strata, Grade 4 





Minority 


Sshflgfr.lB Strata 


NORTH CAROLINA (Small School Cluster Type 2 - Geographic) 


Mid-size Central City 


Low Percent Minority 


10 


Mid-size Central City 


Medium Pcr-nt Minority 


9 


Mid-size Central City 


High. Percent Minority 


10 


Urban Fringe of Mid-size Central City 


Low Percent Minority 


4 


Urban Fringe of Mid-size Central City 


Medium Percent Minority 


S 


Urban Fringe of Mid-size Central City 


High Percent Minority 


4 


Large /Small Town 


Low Percent Minority 


11 


Large/Small Town 


Medium Percent Minority 


11 


Large/Small Town 


High Percent Minority 


10 


Rural 


Low Percent Minority 


17 


Rural 


Medium Percent Minority 


12 


Rural 

NORTH DAKOTA (Small School Cluster Type 3 - Stratified) 
Large Schools 


High Percent Minority 


_12 

115 


Mid-size Central City and Urban Fringe 


None 


36 


Large/Small Town 


None 


31 


Rural 

Small Schools 


None 


51 


None 


None 


_42 

160 


OHIO (Small School Cluster Type 2 - Geographic) 


Large/Mid-size Central City 


Low Percent Minority 


11 


Laige/Mid-size Central City 


Medium Percent Minority 


10 


Large/Mid-size Central City 


High Percent Minority 


11 


Urban Fringe of Large/Mid-size Central City 


None 


32 


Large/Small Town 


None 


24 


Rural 

OKLAHOMA (Small School Cluster Type 3 - Stratified) 
Large Schools 


None 


je 

117 


Large/Mid-size Central City 


Low Percent Minority 


16 


Large/Mid-size Central City 


Medium Percent Minority 


17 


Urban Fringe of Largc/Mid-sizc Central City 


None 


14 


Large/Small Town 


None 


37 


Rural 


None 


34 


Small Schools None 


None 


J3 

141 



51 




Table 3-3 (continued) 

Distribution of the Selected Schools by Sampling Strata, Grade 4 



Urbanisation 


Minority 


Schools 1° i 


PENNSYLVANIA (Small School Cluster Type 2 - Geographic) 


Large /Mid-size Central City 


Low Percent Minority 


9 


Large/Mid-size Central City 


Medium Percent Minority 


9 


Large/Mid-ske Central City 


High Percent Minority 


8 


Urban Fringe of Large/Mid-size Central City 


None 


33 


Large/Small Town 


None 


36 


Rural 


None 


J& 

118 



RHODE ISLAND (Small School Cluster Type 2 - Geographic) 



Large Central City 


Low Percent Minority 


8 


Large Central City 


Medium Percent Minority 


6 


Large Central City 


High Percent Minority 


5 


Mid-size Central City 


None 


9 


Urban Fringe of Large/Mid- size Central City 


None 


55 


Large/Small Town and Rural 


None 


110 



SOUTH CAROLINA (Small School Cluster Type 2 - Geographic) 



Mid-size Central City 


Low Percent Minority 


6 


Mid-size Central City 


Medium Percent Minority 


5 


Mid-size Central City 


High Percent Minority 


6 


Urban Fringe of Mid-size Central City 


Low Percent Minority 


10 


Urban Fringe of Mid-size Central City 


Medium Percent Minority 


10 


Urban Fringe of Mid-size Central City 


High Percent Minority 


10 


Small Town 


Low Percent Minority 


13 


Small Town 


Medium Percent Minority 


12 


Small Town 


High Percent Minority 


12 


Rural 


Low Percent Minority 


9 


Rural 


Medium Percent Minority 


9 


Rural 


High Percent Minority 


—2 






111 



TENNESSEE (Small School Cluster Type 2 - Geograpiiic) 



Large /Mid-size Central City 


Low Percent Minority 


13 


Large /Mid-size Central City 


Medium Percent Minority 


13 


Large/Mid-size Central City 


High Percent Minority 


12 


Urban Fringe of Large/Mid-size Central City 


None 


19 


Large/Small Town 


None 


31 


Rural 


None 


-J2 

120 



52 




Tabic 3-3 (continued) 

Distribution of the Selected Schools by Sampling Strata, Grade 4 



Urbanization 



Minority Schools in Strata 



TEXAS (Small School Cluster Type 2 - Geographic) 
Large Central City 


Low Hispanic/Low Black 


7 


Large Central City 


Low Hispanic/High Blade 


6 


Large Central City 


High Hispanic/Low Black 


7 


Large Central City 


High Hisp mic/High Blade 


6 


Mid-size Central City 


Low Percent Minority 


0’ 

t 


Mid-size Central City 


Medium Percent Minority 


8 


Mid-size Central City 


High Percent Minority 


9 


Urban Fringe of Large /Mid-size Central City 


Low Percent Minority 


7 


Urban Fringe of Large /Mid-size Central City 


Medium Percent Minority 


5 


Urban Fringe of Large/Mid-size Central City 


High Percent Minority 


5 


Large/Small Town and Rural 


Low Percent Minority 


15 


Large/Small Town and Rural 


Medium Percent Minority 


14 


Large/Small Town and Rural 


High Percent Minority 


15 






111 


UTAH (Small School Cluster Type 2 - Geographic) 
Mid-soe Central City 


None 


26 


Urban Fringe of Mid-size Central City 


None 


46 


Large/Smail Town 


None 


15 


Rural 


None 








111 


VIRGINIA (Small School Cluster Type 2 - Geographic) 
Mid-size Central City 


Low Percent Minority 


13 


Mid-size Central City 


Medium Percent Minority 


11 


Mid-size Central City 


High Percent Minority 


12 


Urban Fringe of Large/Mid-size Central City 


Low Percent Minority 


11 


Urban Fringe of Large/Mid-size Central City 


Medium Percent Minority 


10 


Urban Fringe of Large/Mid-size Central City 


High Percent Minority 


9 


Small Town 


Medium Percent Minority 


5 


Small Town 


High Percent Minority 


6 


Large/Small Town 


Low Percent Minority 


5 


Rural 


Low Percent Minority 


9 


Rural 


Medium Percent Minority 


11 


Rural 


High Percent Minority 


__§ 






110 


VIRGIN ISLANDS (Small School Duster Type 1 • None) 


Low Percent Minority 


10 




Medium Percent Minority 


6 




High Percent Minority 


24 



53 
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Table 3-3 (continued) 

Distribution of the Selected Schools by Sampling Strata, Grade 4 



Urbanization Mlnprifr Schools In Strata 



WEST VIRGINIA (Small School Cluster Type 3 - Stratified) 
Large Schools 



Mid-size Central City 


None 


IS 


Urban Fringe of Mid-size Central City 


None 


19 


Large/Small Town 


None 


35 


Rural 


None 


65 


Small Schools 






None 


None 


-12 






156 



WISCONSIN (Small School Cluster Type 3 - Stratified) 
Large Schools 



Large/Mid-size Central City 


Low Percent Minority 


17 


Largc/Mid-size Central City 


Medium Percent Minority 


17 


Urban Fringe of Largc/Mid-size Central City 


None 


20 


Large/Small Town 


None 


31 


Rural 


None 


32 


Small Schools 






None 


None 


12 






129 



WYOMING (Small School Cluster Type 3 - Stratified) 
Large Schools 



Mid-size Central City 


None 


18 


Urban Fringe of Mid-size Central City 


None 


15 


Large/Small Town 


None 


62 


Rural 


None 


25 


Small Schools 






None 


None 


60 






180 




54 




Table 3-4 

Distribution of the Selected Schools by Sampling Strata, Grade 8 



Urbanization 



Minority Schools in Strata 



ALABAMA (Small School Cluster Type 2 - Geographic) 



Mid-size Central City 


Low Percent Minority 


7 


Mid-size Central City 


Medium Percent Minority 


6 


Mid-size Central City 


High Percent Minority 


8 


Urban Fringe of Mid-size Central City 


Low Percent Minority 


10 


Urban Fringe of Mid-size Central City 


Medium Percent Minority 


9 


Urban Fringe of Mid-size Central City 


High Percent Minority 


8 


Small Town 


Low Percent Minority 


10 


Small Town 


Medium Percent Minority 


9 


Large/Small Town 


High Percent Minority 


9 


Rural 


Low Percent Minority 


13 


Rural 


Medium Percent Minority 


8 


Rural 


High Percent Minority 


-2 

106 


ARIZONA (Small School Cluster Type 2 - Geographic) 
Large Central City 


Low Percent Minority 


8 


Large Central City 


Medium Percent Minority 


8 


Large Central City 


High Percent Minority 


8 


Mid-size Central City 


Low Percent Minority 


9 


Mid-size Central City 


Medium Percent Minority 


10 


Mid-size Central City 


High Percent Minority 


9 


Urban Fringe of Large/Mid-size Central City 


Low Percent Minority 


5 


Urban Fringe of Large/Mid-size Central City 


Medium Percent Minority 


6 


Urban Fringe of Large/Mid-size Central City 


High Percent Minority 


5 


Large/Small Town and Rural 


Low Percent Minority 


13 


Large/Small Town and Rural 


Medium Percent Minority 


10 


Large/Small Town and Rural 


High Percent Minority 


12 

103 


ARKANSAS (Small School Ouster Type 2 - Geographic) 
Mid-size Central City and Urban Fringe 


Low Percent Minority 


9 


Mid-size Central City and Urban Fringe 


Medium Percent Minority 


7 


Mid-size Central City and Urban Fringe 


High Percent Minority 


8 


Large/SmaU Town 


Low Percent Minority 


13 


Large/Small Town 


Medium Percent Minority 


15 


Large/SmaU Town 


High Percent Minority 


14 


Rural 


Low Percent Minority 


14 


Rural 


Medium Percent Minority 


9 


Rural 


High Percent Minority 


_u 






100 



55 




Tabic 3-4 (continued) 

Distribution of the Selected Schools by Sampling Strata* Grade 8 



Urbanisation 




Schools 1 b 


CALIFORNIA (Small School Ouster Type 2 - Geographic) 
Large Central City 


Low Percent Minority 


7 


Large Central City 


Medium Percent Minority 


7 


Large Central City 


High Percent Minority 


8 


Mid-size Central City 


low Percent Minority 


5 


Mid-size Central City 


Medium Percent Minority 


5 


Mid-size Central City 


High Percent Minority 


5 


Urban Fringe of Large/Mid- size Central City 


Low Percent Minority 


id 


Uiban Fringe of Large/Mid-size Central City 


Medium Percent Minority 


15 


Urban Fringe of Large/Mid-size Central City 


High Percent Minority 


16 


Large/Small Town and Rural 


Low Percent Minority 


6 


Large/Small Town and Rural 


Medium Percent Minority 


7 


Large/Small Town and Rural 


Hign Percent Minority 


6 

103 


COLORADO (Small School Ouster Type 2 - Geographic) 
Large/Mid-size Central City 


Low Percent Minority 


11 


Large/Mid-size Central City 


Medium Percent Minority 


12 


Large/Mid-size Central City 


High Percent Minority 


11 


Urban Fringe of Large/Mid-size Central City 


Low Percent Minority 


12 


Urban Fringe of Large/Mid-size Central City 


Medium Percent Minority 


12 


Urban Fringe of Large/Mid-size Central City 


High Percent Minority 


13 


Large/Small Town 


L aw Percent Minority 


7 


Large/Small Town 


Medium Percent Minority 


7 


Large/Small Town 


High Percent Minority 


7 


Rural 


Low Percent Minority 


7 


Rural 


Medium Percent Minority 


6 


Rural 


High Percent Minority 


—2 

112 


CONNECTICUT (Small School Ouster Type 2 - Geographic) 
Large Central City 


Low Blade /Low Hispanic 


3 


Large Central City 


Low Black/High Hispanic 


1 


Large Central City 


High Black/Low Hispanic 


2 


Large Central City 


High Black/High Hispanic 


4 


Mid-size Central City 


Low Percent Minority 


5 


Mid-size Central City 


Medium Percent Minority 


6 


Mid-size Central City 


High Percent Minority 


7 


Urban Fringe of Large/Mid-size Central City 


None 


30 


Large/Small Town and Rural 


None 


40 

98 



56 




Table 3-4 (continued) 

Distribution of the Selected Schools by Sampling Strata, Grade 8 




DELAWARE (Small School Cluster Type 1 - None) 



Mid-size Central City 


Low Percent Minority 


1 


Mid-size Central City 


High Percent Minority 


1 


Urban Fringe of Mid-size Central City 


Low Percent Minority 


2 


Urban Fringe of Mid-size Central City 


Medium Percent Minority 


4 


Urban Fringe of Mid-size Central City 


High Percent Minority 


3 


Small Town 


Low Percent Minority 


2 


Small Town 


Medium Percent Minority 


2 


Small Town 


High Percent Minority 


1 


Rural 


Low Percent Minority 


3 


Rural 


Medium Percent Minority 


3 


Rural 


High Percent Minority 


_4 

26 


DISTRICT OF COLUMBIA (Small School Ouster Type 1 


- None) 




Large Central City 


Medium Percent Minority 


14 


Large Central City 


High Percent Minority 


_I2 

33 


FLORIDA (Small School Cluster type 2 - Geographic) 
Large Central City 


Low Black Low Hispanic 


2 


Large Central City 


Low Blade High Hispanic 


5 


Large Central City 


High Blade Low Hispanic 


3 


Large Central City 


High Blade High Hispanic 


4 


Mid-size Central City 


Low Percent Minority 


7 


Mid-size Central City 


Medium Percent Minority 


7 


Mid-size Central City 


High Percent Minority 


6 


Urban Fringe of Large/Mid-size Central City 


Low Percent Minority 


16 


Urban Fringe of Large/Mid-size Central City 


Medium Percent Minority 


15 


Urban Fringe of Large/Mid-size Central City 


High Percent Minority 


14 


Large/Small Town and Rural 


Low Percent Minority 


7 


Large/Small Town and Rural 


Medium Percent Minority 


8 


Large/Small Town and Rural 


High Percent Minority 


101 


GEORGIA (Small School Cluster Type. 2 - Geographic) 
Large/Mid-size Central City 


Low Percent Minority 


6 


Large/Mid-size Central City 


Medium Percent Minority 


7 


Large/Mid-size Central City 


High Percent Minority 


7 


Urban Fringe of Large/Mid-size Central City 


Low Percent Minority 


11 


Urban Fringe of Large/Mid-size Central City 


Medium Percent Minority 


10 


Urban Fringe of Large/Mid-size Central City 


High Percent Minority 


10 


Large/Small Town 


Low Percent Minority 


10 


Large/Small Town 


Medium Percent Minority 


12 


Large/Small Town 


High Percent Minority 


11 


Rural 


Low Percent Minority 


5 


Rural 


Medium Percent Minority 


5 


Rural 


High Percent Minority 








100 
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Tabic 3-4 (continued) 

Distribution of the Selected Schools by Sampling Strata, Grade 8 



Urbanization Minority Schools In Strata 

GUAM (Small School Cluster Type 1 - None) 



Rural 


None 


6 


HAWAII (Small School Cluster Type 1 - None) 
Mid-size Central City 


None 


12 


Urban Fringe of Mid-size Central City 


None 


22 


Large/ Small Town and Rural 


None 


52 


IDAHO (Small School Cluster Type 2 - Geographic) 
Mid-size Central City and Urban Fringe 


None 


10 


Large Town 


None 


10 


Small Town 


None 


26 


Rural ; 


None 


35 

81 


INDIANA (Small School Cluster T/pe 2 - Geographic) 
Large/Mid-size Central City 


Low Percent Minority 


9 


Large/Mid-size Central City 


Medium Percent Minority 


10 


Large/Mid-size Central City 


High Percent Minority 


8 


Urban Fringe of Large/Mid-size Central City 


Low Percent Minority 


10 


Urban Fringe of Large/Mid-size Central City 


Medium Percent Minority 


7 


Urban Fringe of Large/Mid-size Central City 


High Percent Minority 


7 


Large/Small Town 


None 


34 


Rural 


None 


J£2 






105 


IOWA (Small School Cluster Type 2 - Geographic) 
Mid-size Central City and Urban Fringe 


None 


32 


Large/Small Town 


None 


38 


Rural 


None 


-25 






105 


KENTUCKY (Small School Cluster Type 2 - Geographic) 
Mid-size Central City 


Low Percent Minority 


6 


Mid-size Central City 


Medium Percent Minority 


6 


Mid-size Central City 


High Percent Minority 


5 


Urban Fringe of Mid-size Central City 


Low Percent Minority 


6 


Urban Fringe of Mid-size Central City 


Medium Percent Minority 


4 


Urban Fringe of Mid-size Central City 


High Percent Minority 


5 


Rural 


None 


31 


Large/Small Town 


None 


_41 






104 



58 
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Tabic 3-4 (continued) 

Distribution of the Selected Schools by Sampling Strata* Grade 8 



Vrfcanlzgtlpq 


Minority 


Schools In Strata 


LOUISIANA (Small School Cluster Type 2 - Geographic) 
Large /Mid-size Central City 


Low Percent Minority 


9 


Large/Mid-size Central City 


Medium Percent Minority 


10 


Large/Mid-size Central City 


High Percent Minority 


il 


Urban Fringe of Large/Mid-size Central City 


Low Percent Minority 


7 


Urban Fringe of Large/Mid-size Central City 


Medium Percent Minority 


7 


Urban Fringe of Large/Mid-size Central City 


High Percent Minority 


8 


Large/Small Town 


Low Percent Minority 


10 


Large/Small Town 


Medium Percent Minority 


11 


Laxgc/Small Town 


High Percent Minority 


10 


Rural 


Low Percent Minority 


6 


Rural 


Medium Percent Minority 


6 


Rural 


High Percent Minority 


_6 

101 



MAINE (Small School Cluster Type 3 - Stratified) 
Large Schools 



Mid-size Central City and Urban Fringe 


None 


IS 


Small Town 


None 


SO 


Rural 


None 


25 


Small Schools 
None 


None 








100 


MARYLAND (Small School Ouster Type 2 - Geographic) 
Large/Mid-size Central City 


Low Percent Minority 


6 


Large/Mid-size Central City 


Medium Percent Minority 


7 


Large/Mid-size Central City 


High Percent Minority 


6 


Urban Fringe of Large/Mid-size Central City 


Low Percent Minority 


20 


Urban Fringe of Large/Mid-size Central City 


Medium Percent Minority 


21 


Urban Fringe of Large/Mid-size Central City 


High Percent Minority 


18 


Large/Small Town and Rural 


Low Percent Minority 


10 


Large/Small Town and Rural 


Medium Percent Minority 


8 


Large/Small Town and Rural 


High Percent Minority 


_2 






103 


MASSACHUSETTS (Small School Cluster Type 2 - Geographic) 
Large/Mid-size Central City 


Low Percent Minority 


11 


Large/Mid-size Central City 


Medium Percent Minority 


9 


Large/Mid-size Central City 


High Percent Minority 


10 


Urban Fringe of Large/Mid-size Central City 


None 


29 


Large/Small Town 


None 


98 



59 




Tabic 3*4 (continued) 

Distribution of the Selected Schools by Sampling Strata, Grade 8 



Urbanisation 



Miaflzitt SshasliiBuSaate 



MICHIGAN (Small School Cluster Type 2 - Geographic) 



Large/Mid-size Central City 


Low Percent Minority 


7 


Large/Mid-size Central City 


Medium Percent Minority 


8 


Large/Mid-size Central City 


High Percent Minority 


8 


Urban Fringe of Large/Mid-size Central City 


None 


36 


Large/Small Town 


None 


30 


Rural 


None 


_J6 






105 


MINNESOTA (Small School Cluster Type 2 - Geographic) 
Large Central City 


Medium Percent Minority 


6 


Mid-size Central City 


Low Percent Minority 


7 


Urban Fringe of Large/Mid-size Central City 


None 


32 


Large/Small Town 


None 


27 


Rural 


None 


29 






101 


MISSISSIPPI (Small School Cluster Type 2 - Geographic) 
Mid-size Central City 


Low Percent Minority 


4 


Mid-size Central City 


Medium Percent Minority 


3 


Mid-size Central City 


High Percent Minority 


4 


Urban Fringe of Mid size Central City 


Low Percent Minority 


3 


Urban Fringe of Large/Mid-size Central City 


Medium Percent Minority 


3 


Urban Fringe of Large/Mid-size Central City 


High Percent Minority 


4 


Large/Small Town 


Low Percent Minority 


16 


Large/Small Town 


Medium Percent Minority 


15 


Large/Small Town 


High Percent Minority 


14 


Rural 


Low Percent Minority 


11 


Rural 


Medium Percent Minority 


12 


Rural 


High Percent Minority 


_1Q 

99 


MISSOURI (Small School Ouster Type 3 - Stratified) 
large Schools 
Large/Mid-size Central City 


Low Percent Minority 


4 


Large/Mid-size Central City 


Medium Percent Minority 


4 


Large/Mid-size Central City 


High Percent Minority 


4 


Urban Fringe of Large/Mid-size Central City 


Low Percent Minority 


11 


Urban Fringe of Large/Mid-size Central City 


Medium Percent Minority 


12 


Urban Fringe of Large/Mid-size Central City 


High Percent Minority 


11 


Large/Small Town 


None 


28 


Rural 


None 


26 


Small Schools 
None 


None 








106 



60 




Table 3-4 (continued) 

Distribution of the Selected Schools by Sampling Strata, Grade 8 



Urbanisation Minority SshQfll S jn S trata 



NEBRASKA (Small School Cluster Type 3 - Stratified) 

Large Schools 



Mid-size Central City and Urban Fringe 


None 


24 


Laige/Small Town 


None 


25 


Rural 


None 


30 


Small Schools 
None 


None 


32 






111 


NEW HAMPSHIRE (Small School Cluster Type 2 - Geographic) 
Mid-size Central City and Urban Fringe 


None 


13 


Large/Small Town 


None 


45 


Rural 


None 


77 


NEW JERSEY (Small School Cluster Type 2 - Geographic) 
Large /Mid- size Central City 


Low Black /Low Hispanic 


4 


Large/Midsize Central City 


Low Black/High Hispanic 


5 


Large/Mid-size Central City 


High Black/Low Hispanic 


5 


Large/Mid-jize Central City 


High Bladc/High Hispanic 


5 


Urban Fringe of Large Central City 


Low Percent Minority 


24 


Urban Fringe of Large Central City 


Medium Percent Minority 


14 


Urban Fringe of Mid-size Central City 


None 


24 


Large/Small Town and Rural 


None 


24 






105 


NEW MEXICO (Small School Cluster Type 2 - Geographic) 
Mid-size Central City and Urban Fringe 


Low Percent Minority 


10 


Mid-size Central City and Urban Fringe 


Medium Percent Minority 


9 


Mid-size Central City and Urban Fringe 


High Percent Minority 


10 


Large Town 


Low Percent Minority 


4 


Large Town 


Medium Percent Minority 


5 


Large Town 


High Percent Minority 


5 


Small Town 


Low Percent Minority 


9 


Small Town 


Medium Percent Minority 


11 


Small Town 


High Percent Minority 


9 


Rural 


Low Percent Minority 


5 


Rural 


Medium Percent Minority 


9 


Rural 


High Percent Minority 





92 



61 




Tabic 34 (continued) 

Distribution of the Selected Schools by Sampling Strata, Grade 8 



Urbanization 


Miaaritt 


Schools in Strata 


NEW YORK (Small School Ouster Type 2 - Geographic) 
Large/Mid-size Central City 


Low Black/High Hispanic 


10 


Large/Mid-size Central City 


Low Black /Low Hispanic 


11 


Large/Mid-size Central City 


High Black/Low Hispanic 


10 


Large/Mid-size Central City 


High Black /High Hispanic 


11 


Urban Fringe of Large Central City 


Low Percent Minority 


8 


Urban Fringe of Large Central City 


Medium Percent Minority 


7 


Urban Fringe of Mid-size Central City 


None 


18 


Large/Small Town and Rural 


None 


29 






104 


NORTH CAROLINA (Small School Ouster Type 2 - Geographic) 
Mid-size Central City 


Low Percent Minority 


8 


Mid-size Central City 


Medium Percent Minority 


8 


Mid-size Central City 


High Percent Minority 


9 


Urban Fringe of Mid-size Central City 


Low Percent Minority 


5 


Urban Fringe of Mid-size Central City 


Medium Percent Minority 


4 


Urban Fringe of Mid-size Central City 


High Percent Minority 


5 


Large/Small Town 


Low Percent Minority 


11 


Large/Small Town 


Medium Percent Minority 


12 


Large/Small Town 


High Percent Minority 


10 


Rural 


Low Percent Minority 


11 


Rural 


Medium Percent Minority 


10 


Rural 


High Percent Minority 


JU 






103 


NORTH DAKOTA (Small School Ouster Type 3 - Stratified) 
Large Schools 

Mid-size Central City and Urban Fringe 


None 


10 


Large/Small Town 


None 


13 


Rural 


None 


31 


Small Schools 
None 


None 


Ji 

73 


OHIO (Small School Cluster Type 2 - Geographic) 
Large/Mid-size Central City 


Low Percent Minority 


9 


Large/Mid-size Central City 


Medium Perce-' t Minority 


9 


Large/Mid-size Central City 


High Percent Minority 


9 


Urban Fringe of Large/Mid-size Central <Jity 


None 


34 


Latge/Small Town 


None 


23 


Rural 


None 


_21 






105 



62 




Table 3-4 (continued) 

Distribution of the Selected Schools by Sampling Strata, Grade 8 



Urbanization 



Minority Schools in Strata 



OKLAHOMA (Small School Cluster Type 3 - Stratified) 
Large Schools 

Large /Mid- size Central City 


Low Percent Minority 


10 


Large /Mid- size Central City 


Medium Percent Minority 


8 


Large /Mid- size Central City 


High Percent Minority 


7 


Urban Fringe of Large/Mid size Central City 


None 


12 


Large/Small Town 


None 


35 


Rural 


None 


25 


Small Schools 
None 


None 


10 






107 


PENNSYLVANIA (Small School Cluster Type 2 - Geographic) 
Large /Mid- size Central City 


Low Percent Minority 


6 


Large /Mid-size Central City 


Medium Percent Minority 


6 


Large /Mid-size Cental City 


High Percent Minority 


6 


Urban Fringe of Large, Mid-size Central City 


None 


33 


Large/Small Town 


None 


35 


Rural 


None 


16 






102 


RHODE ISLAND (Small School Cluster Type 1 - None) 
Large Central City 


Low Percent Minority 


2 


Large Central City 


Medium Percent Minority 


3 


Large Central City 


High Percent Minority 


1 


Mid-size Central City 


None 


4 


Urban Fringe of Large/Mid- size Central City 


None 


27 


Large/Small Town and Rural 


None 


Jtt 

AO 


SOUTH CAROLINA (Small School Cluster Type 2 - Geographic) 
Mid-size Central City 


Low Percent Minority 


7 


Mid-size Central City 


Medium Percent Minority 


S 


Mid-size Central City 


High Percent Minority 


5 


Urban Fringe of Mid-size Central City 


Low Percent Minority 


9 


Urban Fringe of Mid-size Central City 


Medium Percent Minority 


10 


Urban Fringe of Mid-size Central City 


High Percent Minority 


10 


Small Town 


Low Percent Minority 


13 


Small Town 


Medium Percent Minority 


12 


Small Town 


High Percent Minority 


14 


Rural 


Low Percent Minority 


6 


Rural 


Medium Percent Minority 


7 


Rural 


High Percent Minority 


7 






105 
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Table 3-4 (continued) 

Distribution of the Selected Schools by Sampling Strata, Grade 8 



Ur banisation 



Mlflttrite Sfihfiflli ia Strati 



TENNESSEE (Small School Cluster Type 2 * Geographic) 



Large/Mid-size Central City 


Low Percent Minority 


10 


Large /Mid-size Central City 


Medium Percent Minority 


10 


Large /Mid-size Central City 


High Percent Minority 


11 


Urban Fringe of Large/Mid-rize Central City 


None 


18 


Large/C ..iall Town 


None 


31 


Rural 


None 


_24 






104 


TEXAS (Small School Cluster Type 2 - Geographic) 
Large Central City 


Low Hispanic/Low Black 


6 


Large Central City 


Low Hispanic/High Black 


5 


Large Central City 


High Hispanic/Low Black 


6 


Large Central City 


High Hispanic/High Black 


6 


Mid-size Central City 


Low Percent Minority 


8 


Mid-size Central City 


Medium Percent Minority 


9 


Mid-size Central City 


High Percent Minority 


S 


Urban Fringe of Large /Mid-size Central City 


Low Percent Minority 


6 


Urban Fringe of Large/Mid-size Central City 


Medium Percent Minority 


7 


Urban Fringe of Large/Mid-size Central City 


High Percent Minority 


6 


Large/Small Town and Rural 


Low Percent Minority 


12 


Large/Small Town and Rural 


Medium Percent Minority 


13 


Largc/Small Town *..• d Rural 


High Percent Minority 


-12 






104 


UTAH (Small School Ouster Type 2 - Geographic) 
Mid-size Central City 


None 


21 


Urban Fringe of Mid-size Central Citv 


None 


37 


Rural 


None 


12 


Large/Small Town 


None 


85 


VIRGINIA (Small School Cluster Type 2 - Geographic) 
Mid-size Central City 


Low Percent Minority 


11 


Mid-size Central City 


Medium Percent Minority 


10 


Mid-size Central City 


High Percent Minority 


8 


Urban Fringe of Large/Mid-size Central City 


Low Percent Minority 


11 


Urban Fringe of Large/Mid-size Central City 


Medium Percent Minority 


12 


Urban Fringe of Large/Mid-size Central City 


High Percent Minority 


12 


Small Town 


Low Percent Minority 


5 


Small Town 


Medium Percent Minority 


6 


Small Town 


High Percent Minority 


6 


Rural 


Low Percent Minority 


8 


Rural 


Medium Percent Minority 


9 


Rural 


High Percent Minority 


7 



105 
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Table 3-4 (continued) 

Distribution of the Selected Schools by Sampling Strata, Grade 8 



Urbanization 


Minority 


Ssfrwlg >0 Strata 


VIRGIN ISLANDS (Small School Chute* Type 1 - None) 




Low Percent Minority 


3 




Medium Percent Minority 


2 




High Percent Minority 


6 


WEST VIRGINIA (Small School Ouster Type 2 - Geographic) 


Mid-size Central City 


None 


14 


Urban Fringe of Mid-size Central City 


None 


12 


Rural 


None 


44 


Large /Small Town 


None 


103 


WISCONSIN (Small School Cluster Type 2 - Geographic) 


Mid-size Central City 


Low Percent Minority 


IS 


Large/Mid-size Central City 


Medium Percent Minority 


12 


Urban Fringe of Large/Mid-size Central City 


None 


17 


Laige/Small Town 


None 


33 


Rural 

WYOMING (Small School Cluster Type 3 - Stratified) 
Large Schools 


None 


-28 

105 


Mid-size Central City 


None 


4 


Urban Fringe of Mid-size Central City 


Low Percent Minority 


1 


Urban Fringe of Mid-size Central City 


Medium Percent Minority 


1 


Urban Fringe of Mid-size Central City 


High Percent Minority 


1 


Small Town 


None 


26 


Rural 

Small Schools 


None 


16 


.’one 


None 


6 
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Case 2: Urbanization strata with more than 10 percent Black students or 7 percent 
Hispanic students, but not more than 20 percent of both, were stratified by ordering percent 
minority enrollment within the urbanization classes and dividing the schools into three groups 
with about equal numbers of students per minority group. 

Case 3: In urbanization strata with more than 20 percent of both Black and Hispanic 
students, minority strata were formed as follows. The minority group with the higher percentage 
gave the primary stratification variable; the remaining group gave the secondary stratification 
variable. Within urbanization class, the sdiools were sorted based on the primary stratification 
variable and divided into two groups of schools containing approximately equal numbers of 
students. Within each of these two groups, the schools were sorted by the secondary 
stratification variable and subdivided into two subgroups of schools containing approximately 
equal numbers of students. As a result, within urbanization strata there were four minority 
groups, low Black/low Hispanic, low Black/high Hispanic, high Black/low Hispanic, and high 
Biack/high Hispanic. 

Tables 3-3 and 3-4 provide information on minority stratification for the participating 
states, respectively for fourth and eighth grade. 



3.4.4 Median Household Income 

The median household income variable was not used as a prime stratification variable 
because the available income data were not up to date (Le., they were based on the 1980 census. 
Instead, median household income was used as a sorting variable at the final state of 
stratification. Prior to the selection of the school samples, the schools were sorted by 
urbanization, then by minority classes within urbanization in a serpentine order, in which the 
sort alternated between descending and ascending order within each group. This meant that 
adjacent schools on the list were generally similar with regard to either urbanization or minority 
enrollment, and often to both. Within minority class, the schools were sorted, in serpentine 
order, by the median household income. This final stage of sorting resulted in implicit 
stratification of median income. The data on median household income, which were obtained 
from Donnelly Marketing Information Services, were related to the ZIP code area in which the 
rchool was located. The data are derhed from the 1980 census, but expressed in 1985 dollars. 



3,4.5 Schools With Fewer Than 20 Students 

Schools with fewer than 20 students were collapsed with other schools to form a 
sampling unit of at least 20 students. The two methods used to collapse small schools are 
referred to as geographic and stratified grouping. This collapsing was done separately for fourth 
and eighth grade. 

Geographic Grouping. If the number of small schools in the state was less than 20 
percent, and the number of students in these email schools accounted for less than 1 percent of 
the total state grade enrollment, then each school vas combined with a school dose by 
geographically until the duster contained at least 20 students. 
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duster level values for enrollment, urbanization, minority, income variables, and 
selection probabilities was equal to the corresponding values of the school in the duster with 
largest enrollment. 

Stratified Grouping. In states with a larger number of small schools (duster Type 3 
states), schools were stratified into two groups. One group contained schools with fewer than 20 
students, the other group contained schools with 20 or more students. The schools ;n the first 
group were dustered in the following manner. The schools were ordered from smallest to 
largest, then the largest school was matched with the smallest school If this duster contained 20 
or more students, it was complete. If the total duster enrollment was 19 or smaller, the next 
smallest school was added. This continued until the sum of the er 'ollment was at least 20. We 
proceeded to form the next duster with the next largest and smallest school in the same manner. 
If, after forming all the clusters, there remained a duster with fewer than 20 students, it was 
combined with the previous cluster. 

The enrollment value assigned to a duster was equal to the sum of enrollments of the 
schools in that duster. The minority value assigned to the duster was equal to the weighted 
average of the proportion of minority for schools in the duster where the weight was the 
fourth/eighth-grade enrollment. The duster level income value was the median income of the 
school with the largest enrollment. No urbanization value was desired for dusters of schools. 
Also, no selection probability was derived for dusters of schools since they were selected with 
equal probability. 

Tables 3-3 and 3-4 show the type of stratification used for small schools within the 
participating states, for fourth- and eighth-grade samples. 



3J5 SCHOOL SAMPLE SELECTION FOR THE 1992 TRIAL STATE ASSESSMENT 
3.5.1 Control of Overlap of School Samples for National Educational Studies 

The issue of school sample overlap has been relevant in all rounds of NAEP in recent 
years, but no more so than in 1992. NAEP collected data nationally from a number of distinct 
samples at all three age classes, while state assessments were conducted at grades 4 and 8. At 
the same time, the U.S. Department of Education conducted the first phase followup to 
Prospects: The National Longitudinal Study of Chapter I Children (Abt Associates, 1991), for which 
a sample of districts was selected prior to the 1992 Trial State Assessment sample selection. 

This study involved substantial student assessment at grades 4 and 8. 

To avoid undue burden on individual schools, NAEP developed a policy for 1992 of 
avoiding overlap of school samples from different studies for the same age class. This was to be 
achieved without unduly distorting the resulting samples by introducing bias or substantial 
variance. Thus, at grade 8 for example, the school samples for the national samples, the state 
samples, and the Prospects samples were selected to contain different schools, to the extent 
feasible. Besides generally controlling overlap within grade, distinct schools were selected for 
the fourth- and eighth-grade state assessment samples within a state to the extent feasible. The 
procedure used was an extension of the method proposed by Keyfitz (1951). The general 
approach is as follows. 
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Consider as an example the selection of samples for the Trial State Assessment eighth- 
grade sample. At the time of drawing the NAEP samples, the identities of the Prospects sample 
schools were not known. Since the selected districts, and district selection probabilities for all 
districts, were known, this information was used to control sample overlap. For each school in 
the frame for the national and state NAEP samples, there was a flag, C, indicating whether 
(C«l) or not (C=0) the district containing the school was included in the Prospects sample, 
and a Prospects district selection probability, P c * P(C=1). 

In controlling overlap between NAEP state and national sample school selections, we 
used national school selection probabilities that were conditional on the selection of national 
sample PSUs (ie., the school-within-PSU selection probabilities). This meant that in selecting 
the state samples, in those states where there was no PSU selection for the national samples no 
adjustments were needed to account for the selection of national NAEP samples (which might 
have selected schools within that state but, in fact, did not). This procedure of conditioning on 
the selection of PSUs also recognizes the impact of the heavy within-PSU sampling in 
noncertainty PSUs in some states, even though the unconditional probabilities of selection for 
such schools in the national samples were quite low. In other words, conditioning on the 
national PSU sample reduces the variance of the state samples, although it leads to a greater 
degree of sample overlap than if unconditional national selection probabilities had been used in 
the procedure for controlling overlap between state and national samples. 

Let N * 1 if the school is selected in the national sample; let N - 0 otherwise. Let 
P N = P(N « 1). Thus, P N = 0 for schools not located within a selected national sample PSU. 

Let x» denote the expected number of times a school is to be selected for the state eighth-grade 
sample. The actual number of times that a school will be selected for the sample with the 
systematic sampling procedure used is equal to x, if x, is an integer, or to one of the two 
integers closest to x, if x, is not an integer. Large schools within a state may be selected up to 
three times; that is, x, can be as great as 3 for some schools. The sample size of students to be 
drawn within the school is proportional to the number of times the school is selected. Schools 
to be included with certainty in the state sample (x, & 1) are not subject to overlap control, as 
such schools are self-representing in the stak sample. Excluding such schools on a random basis 
would add undue variance to the state estimates. 

For x, < 1, x, denotes a true probability of selection for the school Where possible, 
schools in districts selected for the Prospects study were excluded provided that the Prospects' 
district selection probability, P 0 fell below a constant, k„ that varied from state to state. In small 
states, where it is important to include all eligible schools in the state sample, k, was set to zero. 
The variable C indicates whether (C= 1) or not (C«0) the district was included in the Prospects 
sample. For actually drawing the state samples, a conditional expected number of selections, t,*, 
was derived for each school in the frame as follows: 



x; - v t 


if x t 1 


/(i - <w 


if x, < 1, P c > k, and N 
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*, ft ~ * N ) 
(1 

** /(^ - 'met) 



if x, < 1, P c > and N • 1 

if x t < 1, P c £ Ar^ and (C * 0 and N * 0) 



*j(ft + ^°c v «c) 

ft * ftK 1 " vj 



if x, < 1 , P c £ k f and (C - 1 or N * 1) 



where <f> N ~ minft 1 • x,) and v NC « min(/^ + P 0 1 - x,). The values of x„* are conditional on 
the selection of districts for the Prospects sample and PSUs for the nation^ NAEP samples. 



This procedure in general gave state NAEP conditional selection probabilities that are 
smaller than the unconditional selection probabilities for schools located in Prospects selected 
districts, and for schools selected fo: the national sample. The relative chance of selection in 
the state sample for a school selected in either of these other two samples, compared to its 
chance of selection in the state sample if not selected for either of the other samples, is 
(P N - <£n)/P n if P c > k, and (P^ + P c - Pnc)/ (P n E c ) if Pc £* k,. If P N , Pc and x, are all 
relatively small, then P N + P c - y NC = 0, so that there was no chance of selecting the school for 
the state sample if it is in the national sample or in a Prospects district selection. The expected 
number of times a school was selected in the state sample, conditional on the national PSU 
sample but unconditional on the national school sample selection within PSUs and the selection 
of districts for the Prospects sample, is given by x„ as desired. This follows from the above 
formulation of x 4 * and the fact that P(C= 1 or N- 1) equals P N + P c when P c £ k, since there is 
no overlap of NAEP national sample selected schools and Prospect selected districts in this 
case. The quantity x f is used as the basis for weighting the schools, and hence students, in the 
state samples. 



To illustrate the use of these expressions in drawing the state sample, suppose that 
P c > k, (or P c «=0) so that we are concerned only with controlling overlap with the national 
sample. Suppose that x ( ~ 03 and P N = 035. Then ■ P H « 035, and x/= 0.4 if the school 
is not selected for the national sample. Thus in this case the school is selected to conduct a 
single assessment session of about 30 students with probability 0.4. Since * P N , x* * 0 if the 
school is selected for the national sample. Thus there is no chance that this school will be 
selected for both national and state samples. Integrating over the national sampling process 
gives the required unconditional state selection probability of 03 ( = 0.4 * (1 - 035) + 0 * 035). 



333 Selection of Schools in Small States (Cluster Type 1 States) 

For states with small numbers of schools, and no or very few small schools, all schools 
were included in the sample with certainty. In the fourth grade, all the eligible fourth-grade 
schools in the District of Columbia, Delaware, Guam, and the Virgin Islands were taken into the 
sample with certainty. In the eighth grade, all the eligible schools were taken from the District 
of Columbia, Delaware, Hawaii, Rhode Island, Guam, and the Virgin Islands. 



69 





3 S3 States with Geographic Clustering of Small Schools (Cluster Type 2 States) 

Ousters were sorted by urbanization, minority strata (which varied by state and 
urbanization level), and median income. A systematic sample of clusters was then selected for 
each state with probability proportionate to size, where size was equal to the estimated grade 
enrollment within the school, so as to achieve the desired student sample size of 3,150 for the 
eighth grade and 6300 for the fourth grade. 

Up to three sessions (90 students) were selected from each school to more efficiently 
represent the large schools in the eighth-grade sample. The fourth-grade sample selected two 
sessions from larger schools (those with more than 20 students), one for reading and one for 
mathematics assessments. 

Following the selection of dusters, there was some thinning of small schools. The 
purpose of thinning was to give students in small schools (enrollment less than 20) 
approximately the same chance of selection as those from larger schools. In addition, thinning 
of small schools controlled the number of schools in tne sample to be dose to the desired 
number, and thereby controlling the cost of data collection. All small schools in a duster were 
retained in the sample with probability F»/P c where P, was the probability of selection of the 
small school and P c was the probability of selection of the duster. 

Table 3-5 shows the distribution of selected schools in the participating states. 



35.4 States with Stratification of Small Schools (Cluster Type 3 States) 

As described above, dusters were sorted by urbanization, minority strata (which varied 
by state and urbanization level), and median income within the two size dusters. Small school 
dusters were sdected systematically with equal probability, and large schools were sampled 
systematically with probability proportionate to size, sc as to achieve the desired student sample 
size of 3,150 for the eighth grade and 6300 for the fourth grade. 

Similar to Ouster Type 2 states, up to three eighth-grade sessions were selected within 
each school to more efiidently represent the larger schools in the sample. For the fourth-grade 
samples, each selected school was chosen for one reading and one mathematics session except 
for schools with fourth-grade enrollment of fewer than 20, which were assigned only a single 
session. 

Table 3-5 shows the distribution of sdected schools in the participating states. 



355 Overlap of School Samples 

As stated in section 35.1, the sample design for eighth-grade schools minimized, to the 
extent feasible, the chances of selecting eighth-grade schools in the 1992 national NAEP and the 
Prospects survey. Furthermore, the fourth-grade state samples were selected such that the 
number of schools in each state selected for both fourth- and eighth-grade samples were 
minimized to the extent feasible. 
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Tabic 3-5 

Distribution of Sample Sizes by School Size, with Corresponding Overlap Between Grades 



Number of Small* Schools Sampled for. 



Number of Other Schools Sampled for. 



State 


4th Only 


8th Only 


4th & 8th 


44h Only 


8th Only 


Alabama 


2 


2 


0 


113 


105 


Arizona 


6 


6 


0 


106 


102 


Arkansas 


7 


3 


0 


119 


98 


California 


8 


0 


c 


108 


105 


Colorado 


15 


8 


0 


114 


105 


Connecticut 


1 


0 


0 


114 


101 


Delaware 


0 


0 


2 


50 


24 


District of Cojl*. '>ia 


3 


0 


0 


108 


28 


Florida 


1 


0 


0 


106 


105 


Georgia 


0 


0 


0 


109 


104 


Guam 


0 


0 


0 


21 


6 


Hawaii 


1 


2 


0 


94 


43 


Idaho 


15 


8 


0 


115 


74 


Indiana 


2 


0 


0 


Hi 


10S 


Iowa 


14 


2 


0 


129 


106 


Kentucky 


4 


5 


0 


122 


105 


Louisiana 


7 


0 


0 


111 


108 


Maine 


40 


10 


0 


122 


87 


Maryland 


2 


0 


0 


109 


104 


Massachusetts 


1 


0 


0 


122 


105 


Michigan 


2 


0 


0 


114 


105 


Minnesota 


4 


0 


0 


116 


104 


Mississippi 


1 


0 


0 


110 


102 


Missouri 


14 


6 


0 


116 


101 


Nebraska 


79 


42 


0 


121 


79 


New Hampshire 


25 


3 


0 


115 


75 


New Jersey 


4 


1 


0 


119 


106 


New Mexico 


12 


7 


0 


110 


86 


New York 


1 


2 


0 


109 


105 


North Carolina 


4 


1 


0 


113 


105 


North Dakota 


45 


23 


0 


118 


55 


Ohio 


2 


0 


0 


116 


105 


Oklahoma 


26 


12 


0 


118 


98 


Pennsylvania 


0 


0 


0 


118 


104 


Rhode Island 


0 


0 


1 


108 


47 


South Carolina 


2 


0 


0 


111 


105 


Tennessee 


7 


1 


0 


115 


105 


Texas 


5 


2 


0 


109 


105 


Utah 


6 


1 


0 


106 


87 


Virginia 


3 


0 


0 


111 


106 


Virgin Islands 


1 


0 


0 


22 


5 


West Virginia 


26 


1 


0 


140 


106 


Wisconsin 


14 


2 


0 


120 


105 


Wyoming 


61 


3 


12 


120 


47 



0 

0 

0 

0 

0 

0 

2 

7 

0 

0 

0 



0 

0 

0 

3 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

0 

0 

0 

6 

0 

0 

0 

0 

0 

1 

0 

0 

2 



•Small school denotes a school with fewer than 20 students. 
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Table 3-5 shows the overlap of fourth- and eighth-grade schools in participating states. 
Table 3-6 shows the number of schools selected in both the national and state assessments. 



3.5.6 New School Selection 

A district-level file was constructed from the aggregate of the fourth- and eighth-grade 
school frame. The file was divided into a small districts file, consisting of those districts in 
which there were at most two schools on the aggregate frame but no more than one fourth- and 
one eighth-grade school. The remainder of districts were denoted as "large" districts. 

All new eligible schools coming from "small" districts (those with at most one grade 4 
and one grade 8 school) that had a school selected in the regular sample for that grade were 
included in the sample and treated as belonging to the same cluster as the original selection 
from that district. 

A sample of "large* districts was drawn in each state. All districts were selected in 
Delaware, the District of Columbia, Guam, Hawaii, and the Virgin Islands. The remainder of 
the states in the file of "large" districts (eligible for sampling) was divided in two files within 
each state; two districts were selected with equal probability among the districts with combined 
enrollment of about 20 percent of the state enrollment. 

From the rest of the file, eight districts per state were selected with probability 
proportional to enrollment. The selected districts were then sent a listing of all their schools 
that appeared on the QED sampling frame, and were asked to provide information about the 
new schools not included in the QED frame. These listings, provided by selected districts, were 
used as sampling frames for selection of new schools. 

The eligibility of a school was determined based on the grade span. A school was 
classified as "new” if the school was eligible for sampling based on its grade span but not 
included in the QED frame, or if the changes of grade span were such that the school status 
changed from ineligible to eligible. The average grade enrollment for these schools was set to 
the average grade enrollment before the grade span change. The schools found eligible for 
sampling due to the grade span change were added to the corresponding grade frame. 

Similar to the main sample, we assigned the following measure of size to eighth-grade 
schools to produce self-weighting samples of students: 



. 30 if eighth -grade enrollment < 35 

eighth- grade enrollment otherwise 



The probability of selecting a school was min ot size , ll , 

P(distnct) J 

where P(district) was the probability of selection of a district and the sampling rate was the rate 
used for the particular state in the selection of the original sample of schools. 
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Table 3-6 

Number of Schools Selected for Both National and State Samples, by State 



State 


Both Grade 4 


Both Grade 8 


Different 

Grades 


Total 


Alabama 


0 


1 


2 


3 


Arizona 


0 


1 


3 


4 


Arkansas* 


1 


2 


5 


7 


California 


0 


0 


0 


0 


Colorado 


0 


0 


0 


0 


Connecticut 


0 


4 


3 


7 


Delaware 


0 


0 


0 


0 


District of Columbia 


0 


1 


0 


1 


Florida 


0 


2 


1 


3 


Georgia 


0 


1 


0 


1 


Hawaii 


0 


0 


0 


0 


Idaho 


0 


0 


0 


0 


Indiana 


0 


2 


1 


3 


Iowa 


0 


1 


0 


1 


Kentucky 


0 


3 


1 


4 


Louisiana 


0 


2 


7 


9 


Maine 


0 


0 


0 


0 


Maryland 


0 


2 


0 


2 


Massachusetts 


0 


0 


0 


0 


Michigan 


0 


0 


0 


0 


Minnesota 


0 


1 


1 


2 


Mississippi 


0 


0 


0 


0 


Missouri 


0 


1 


0 


1 


Nebraska* 


0 


8 


5 


12 


New Hampshire 


0 


0 




0 


New Jersey 


0 


0 


1 


1 


New Mexico 


0 


4 


4 


9 


New York 


1 


1 


0 


1 


North Carolina 


0 


0 


0 


0 


North Dakota 


0 


0 


0 


0 


Ohio 


0 


0 


1 


1 


Oklahoma 


0 


1 


1 


2 


Pennsylvania 


0 


1 


2 


3 


Rhode Island 


0 


3 


0 


3 


South Carolina 


0 


1 


1 


2 


Tennessee 


0 


0 


2 


2 


Texas 


0 


0 


0 


0 


Utah 


0 


0 


0 


0 


Virgini/i 


0 


0 


0 


0 


West Virginia 


0 


2 


2 


4 


Wisconsin 


0 


2 


2 


4 


Wyoming 


1 


3 


3 


7 


TOTAL 


3 


50 


48 


101_ 



*One ichool was selected for State Grade 8 and National Grades 8 and 12, 
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The probability of selection of a school for the fourth-grade sample was the same as the 
eighth grade, with the measure of size for the fourth-grade schools being 

* 

, 60 if enrollment £ 70 

enrollment if enrollment > 70 



The selection of the fourth- and eighth-grade sample was independent; thus, no selection 
probability adjustments was needed from one grade selection to the other. In each state, the 
sampling rates used for the main sample of fourth- and eighth-grade schools were used to select 
the new schools for the fourth- and eighth-grade samples, respectively. 

Tables 3-7 and 3-8 show the number of new schools coming from the "large" and "small" 
districts for the fourth- and eighth-grade samples. 



3.5.7 Assigning Subject Session Types at Grade 4 

In the interest of sampling efficiency it was desirable that each of the two subjects 
assessed at grade 4, reading and mathematics, be administered in as large a subset of the 
sampled schools as possible. On the other hand it was unreasonable to expect very small schools 
to conduct two different sessions with half of the eligible students in each. To satisfy these two 
requirements the following procedure was used. 

If, according to the information on the frame, the school had an enrollment of 21 or 
more grade 4 students, the school was assigned initially to conduct both mathematics and 
reading sessions, with half of the selected students being assigned to a mathematics session, and 
half to a reading session (see section 3.6 for a description of the student sampling process). 

This varied only in Guam, where all students took both assessment types. 

If, according to the frame data, the school enrollment was 20 or fewer, the school was 
randomly assigned to conduct either a mathematics or a reading session. The assignment was 
systematic, based on the ordering of the dusters for sample selection, with random ordering of 
selected schools within dusters. 

If a school had two session types assigned initially, but was found at the time of drawing 
the student samples to have fewer than 21 eligible students, the school was randomly assigned to 
conduct only one of the two session types, with each type being chosen with probability 0.5. 

This assignment was independent from school to schoc-. Thus a school was to conduct a single 
session type if either its frame or its actual enrollment for grade 4 was 20 or fewer; a sc^ol was 
to conduct both session types if both its frame and actual enrollments exceeded 20. 



3.5.8 Designating Monitor Status 

Within each state, random equivalent half samples of schools were assigned to be 
monitored or unmonitored. The details of the implementation of the monitoring process in the 
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Table 3-7 

Distribution of New Schools Coming from "Large" and "Small" Districts in the Fourth-grade Sample 



State 


Number of New Schools 


"Large* Districts 


■Saudi* Districts 


Alabama 




• 


Arizona 


- 


- 


Arkansas 


1 


- 


California 


2 


1 


Colorado 


2 


- 


Connecticut 


- 


- 


Delaware 


3 


- 


District of Columbia 


1 


- 


Florida 


5 


- 


Georgia 


1 


- 


Guam 


- 


- 


Hawaii 


1 


- 


Idaho 


- 


- 


Indiana 


1 


- 


Iowa 


«* 


- 


Kentucky 


2 


- 


Louisiana 


- 


- 


Maine 


• 


- 


Maryland 


2 


- 


Massachusetts 


- 


1 


Michigan 


1 


- 


Minnesota 


1 


- 


Mississippi 


1 


- 


Missouri 


1 


- 


Nebraska 


- 


- 


New Hampshire 


- 


- 


New Jersey 


1 


- 


New Mexico 


- 


- 


New York 


- 


- 


North Carolina 


4 


- 


North Dakota 


1 


- 


Ohio 


5 


- 


Oklahoma 


- 


- 


Pennsylvania 


1 


- 


Rhode Island 


ts 


• 


South Carolina 


1 


- 


Tennessee 


2 


- 


Texas 


- 


■ 


Utah 


- 


1 


Virginia 


5 


- 


Virgin Islands 


- 


- 


West Virginia 


1 


- 


Wisconsin 


1 


- 


Wyoming 


- 


: 1 
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Tabic 3-8 

Distribution of New Schools Coming from 'Large* and "Small” Districts in the Eighth -gr?de Sample 




Number of New Schools 



■Lute* District! ■Saudi* Districts 



Alabama 

Arizona 

Arkansas 

California 

Colorado 

Connecticut 

Delaware 

District of Columbia 

Florida 

Georgia 

Guam 

Hawaii 

Idaho 

Indiana 

Iowa 

Kentucky 

Louisiana 

Maine 

Maryland 

Massachusetts 

Michigan 

Minnesota 

Mississippi 

Missouri 

Nebraska 

New Hampshire 

New Jersey 

New Mexico 

New York 

North Carolina 

North Dakota 

Ohio 

Oklahoma 
Pennsylvania 
Rhode Island 
South Carolina 
Tennessee 
Texas 
Utah 
Virginia 
Virgin Islands 
West Virginia 
Wisconsin 
Wyoming 




76 


















field are given in Chapter 4. The purpose of monitoring a random half of the schools was to 
ensure that the procedures were being followed throughout each state by the school and district 
personnel administering the assessments, and to provide data adequate for assessing whether 
there was a significant difference in assessment results between monitored and unmonitored 
schools within each state. 

The following procedure was used to determine the sample of schools to be monitored. 
The initially selected clusters were sorted in the order in which they were systematically selected 
(see sections 3.5.2 to 3.5.4). New schools from ‘large" districts added to the sample (see section 
3.5.6) were treated as single school clusters, and were added to the end of the list in random 
order. The sorted clusters were then paired, and one member of each pair was assigned at 
random, with probability 0.5, to be monitored. The assignment was independent across pairs. If 
there was an odd number of clusters, the last cluster was assigned, with 05 probability, to be 
monitored. 

If a cluster was designated to be monitored, all selected schools within the cluster (after 
thinning of small schools from multiple school dusters in Cluster Type 2 states; see section 
3.53) were assigned to be monitored. For the grade 4 samples, this procedure, in combination 
with the procedure for assigning schools to subjects (see section 35.7), ensured that for every 
pair of dusters for each subject at least one school would be monitored and at least one would 
not. 



In the territories of Guam and the Virgin Islands, there were few schools in each sample, 
and large samples of students (that all of the students enrolled) were drawn from each school 
In these jurisdictions the monitoring assignment was done at the level of the physical assessment 
session, rather than at the duster leveL After establishing in each school the number of sessions 
to be conducted, alternate sessions were designated to be monitored, with the first session 
assigned at random. Thus all schools contained some monitored and some unmonitored 
sessions. 



35.9 Substitutes 

A substitute school was selected for each selected school containing eligible students, for 
which school nonparticipation was established by the state coordinator as of November 1, 1991. 
The process of selecting a substitute for a school involved identifying the most similar school in 
terms of the following characteristics: urbanization, percent Black enrollment, percent Hispanic 
enrollment, fourth-grade (or eighth-grade, as applicable) enrollment, and median income. To 
identify candidates for substitution, a set of schools was found that provided reasonable matches 
with regard to fourth/eighth-grade enrollment, and percent Black and Hispanic enrollment. 
From among this set a match was selected, considering all five characteristics. Schools in the 
National Assessment sample and those in the Prospects study were avoided in the selection of 
substitutes, where possible. Furthermore, the substitute was selected from the same district, 
wherever possible, to avoid placing the burden of replacing a refusing school from one district 
on another district. This was often not possible, however, as in the majority of cases the 
decision not to participate was made at the district level 

Tn the cases where no suitable substitute could be found among those schools not 
sampled (most often because all or most schools were included in the original sample), a school 
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already in the sample conducted a double session, of which one session served as a substitute for 
students in the refusing school. Hie same criteria were applied in selecting the schools that 
conducted double sessions; that is, a reasonable match was found based on grade enrollment, 
percent of Black and Hispanic enrollment, median income, and urbanization. 

Tables 3-9 and 3*10 indude information about the number of substitutes provided for 
each grade and in each state. Of the 44 states participating, 27 were provided with at least one 
substitute. Among states receiving no substitutes, the majority had 100 percent participation 
from the original sample. In a few cases, however, refusals did occur after the November 1 
deadline. The number of substitutes provided to a state ranged from 0 to 59 in the fourth grade 
and 0 to 43 in the eighth grade. A total of 591 substitutes were selected for die fourth-grade 
sample, 23 of which were double session substitutes. A total of 460 substitutes were selected for 
the eighth-grade sample, 75 of which were double session substitutes. Some states did not 
attempt to solicit participation from the substitute schools provided, as they considered the 
timing too late to seek cooperation from schools not previously notified about the assessment 
In quite a few cases the originally selected school agreed to cooperate after a substitute was 
selected and had agreed to participate (in which case the substitute school data were discarded). 

Tables 3*11 and 3-12 show the number of schools in the fourth- and eighth-grade 
samples for the mathematics assessment, together with school response rates observed within 
participating states. Refer to the Trial State Assessment report entitled School and Student 
Participation Rates for the Mathematics Assessment and Guidelines for Sample Participation, 
September 1992, for an analysis of participation rates. The tables also show the number of 
substitutes in each state that were associated with a nonparticipating original school selection, 
and the number of those that participated. 



3.6 STUDENT SAMPLE SELECTION 

Schools initially sent a complete list of students to a central location in November 1991. 
Schools were not asked to list students in any particular order, but were asked to implement 
checks to ensure that all fourth/eighth-grade students were listed. Based on the total number of 
students on this list, called the Student Listing Form, sample line numbers were generated for 
student sample selection. To generate these line numbers, the sampler entered the number of 
students on the form and the number of mathematics and reading sessions into a calculator that 
had been programmed with the sampling algorithm. Hie calculator generated a random start 
that was used to systematically select the student line numbers (30 per session). To compensate 
for new enrollees not on the Student Listing Form, extra line numbers were generated for a 
supplemental sample of new students. All students were selected in those schools with grade 
enrollment size of up to 10 percent more than the required sample size of students. This 
sample design was intended to give each student within the state approximate^ the same chance 
of selection. 

The states where all schools were selected with certainty (Cluster Type 1 states) were 
treated differently. For the fourth-grade sample in Delaware and the District of Columbia, 120 
students were selected, where possible. If the enrollment was lower than 120, all of the students 
were taken. In the territories, all of the fourth-grade students were included in the sample. In 
the six states where all schools were selected at grade 8, up to 90 students were selected for the 
sample from each school, depending on the school size. 



n 
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Table 3-9 

Substitute School Counts for Grade 4 



State 


Double Session 
Substitutes 


Regular 

Substitutes 


Total 


Alabama 


2 


27 


29 


Arkansas 


0 


13 


13 


California 


0 


16 


16 


Idaho 


0 


24 


24 


Indiana 


0 


28 


28 


Kentucky 


0 


3 


3 


Maine 


3 


53 


56 


Maryland 


0 


i 


1 


Massachusetts 


0 


15 


15 


Michigan 


0 


20 


20 


Minnesota 


1 


15 


16 


Mississippi 


0 


2 


2 


Missouri 


0 


9 


9 


Nebraska 


0 


59 


59 


New Hampshire 


0 


42 


42 


New Jersey 


0 


53 


53 


New Mexico 


2 


32 


34 


New York 


0 


28 


28 


North Carolina 


0 


5 


5 


North Dakota 


1 


46 


47 


Ohio 


0 


27 


27 


Oklahoma 


0 


15 


15 


Pennsylvania 


0 


17 


17 


Rhode Island 


14 


2 


16 


South Carolina 


0 


2 


2 


Tennessee 


0 


8 


8 


Texas 


0 


6 


6 


TOTAL 


23 


568 


591 
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Table 3-10 

Substitute School Counts for Grade 8 



State 


Double Session 
Substitutes 


Regular 

Substitutes 


Total 


Alabama 


4 


36 


40 


Arkansas 


0 




10 


California 


1 


13 


14 


Idaho 


7 


6 


13 


Indiana 


2 


19 


21 


Kentucky 


0 


3 


3 


Maine 


16 


19 


35 


Maryland 


3 


7 


10 


Massachusetts 


0 


12 


12 


Michigan 


0 


23 


23 


Minnesota 


3 


14 


17 


Mississippi 


0 


1 


1 


Missouri 


0 


8 


8 


Nebraska 


4 


30 


34 


New Hampshire 


4 


11 


15 


New Jersey 


0 


43 


43 


New Mexico 


17 


7 


24 


New York 


1 


23 


24 


North Carolina 


0 


4 


4 


North Dakota 


1 


16 


17 


Ohio 


0 


23 


23 


Oklahoma 


2 


15 


17 


Pennsylvania 


1 


24 


25 


Rhode Island 


8 


0 


8 


South Carolina 


0 


4 


4 


Tennessee 


1 


9 




Texas 


0 


5 


5 


TOTAL 


75 


385 


460 
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Tabic 3-11 

Distribution of the Grade 4 Mathematics School Sample by State 



State 


Weighted Percent School 
Participation 


Number of Schools In the 
Original Sample 


N amber at Saherttnte 

fog 

Noapartfeipating Originals 


Total 

Number of 
Schools That 
Participated 


Before 

Substitution 


After 

Substitution 


Total 


Not 

Eligible 


Participated 


Provided 


Participated 


Alabama 


74.91 


97.04 


113 


3 


81 


27 


25 


106 


Arizona 


100.00 


100.00 


110 


2 


108 


0 


0 


108 


Arkansas 


89.65 


99.13 


123 


2 


109 


11 


11 


120 


California 


9130 


9658 


115 


3 


101 


7 


7 


108 


Colorado 


100.00 


100.00 


123 


2 


121 


0 


0 


121 


Connecticut 


99.03 


99.03 


115 


4 


110 


0 


0 


110 


Delaware 


92.15 


92.15 


56 


6 


44 


0 


0 


44 


Dist. of Columbia 


9832 


9832 


114 


5 


107 


0 


0 


107 


Florida 


100.00 


100.00 


111 


1 


110 


0 


0 


110 


Georgia 


100.00 


100.00 


110 


2 


108 


0 


0 


108 


Guam 


9438 


94.18 


21 


0 


20 


0 


0 


20 


Hawaii 


100.00 


100.00 


108 


0 


108 


0 


0 


108 


Idaho 


8337 


96.62 


120 


0 


98 


21 


17 


115 


Indiana 


7631 


91.06 


118 


2 


88 


26 


17 


105 


Iowa 


100.00 


100.00 


132 


4 


128 


0 


0 


128 


Kentucky 


92.85 


95.65 


124 


3 


115 


3 


3 


118 


Louisiana 


100.00 


100.00 


113 


4 


109 


0 


0 


109 


Maine 


5651 


7133 


142 


2 


75 


43 


22 


97 


Maryland 


99.19 


99.19 


112 


1 


in 


1 


0 


110 


Massachusetts 


8651 


96.64 


123 


4 


103 


12 


11 


114 


Michigan 


83.04 


8957 


114 


3 


90 


16 


8 


98 


Minnesota 


8159 


93.74 


118 


5 


93 


15 


13 


106 


Mississippi 


98.09 


100.00 


111 


2 


107 


2 


2 


109 


Missouri 


892S 


97.06 


120 


7 


101 


9 


9 


110 


Nebraska 


7938 


8735 


157 


6 


109 


36 


11 


120 


New Hampshire 


68.72 


8038 


126 


3 


84 


32 


16 


100 


New Jersey 


7536 


8134 


119 


3 


88 


22 


7 


95 


New Mexico 


75.45 


9038 


116 


2 


86 


26 


18 


104 


New York 


77.74 


8335 


107 


0 


83 


21 


7 


90 


North Carolina 


95.15 


99.09 


118 


2 


111 


5 


5 


116 


North Dakota 


7336 


8934 


133 


3 


97 


30 


19 


116 


Ohio 


78.63 


9139 


122 


1 


95 


21 


15 


110 


Oklahoma 


86.14 


98.0 < 


129 


3 


111 


14 


13 


124 


Pennsylvania 


8437 


95.40 


116 


0 


99 


17 


12 


111 


Rhode Island 


8332 


96.16 


115 


5 


90 


15 


15 


105 


South Carolina 


98.06 


99.03 


112 


2 


108 


1 


1 


109 


Tennessee 


9156 


92.71 


120 


2 


108 


8 


1 


109 


Texas 


9339 


9753 


111 


3 


100 


5 


5 


105 


Utah 


99.05 


99.05 


110 


1 


108 


0 


0 


108 


Virginia 


9859 


98.99 


116 


4 


111 


0 


0 


111 


Virgin Islands 


100.00 


100.00 


24 


0 


24 


0 


0 


24 


West Virginia 


100.00 


100.00 


147 


6 


141 


0 


0 


141 


Wisconsin 


100.00 


100.00 


127 


5 


122 


0 


0 


122 


Wyoming 


96.77 


96.77 


1S7 


11 


143 


0 


0 


143 






Tabic 3*12 

Distribution of the Grade 8 Mathematics School Sample by State 



State 


Weighted PntMl School 
Partidpatioa 


Number of Schools la ttu 
Origins! Samp!* 


N—thar of Sahrtttutc 
Schools far 

Noupsrttdpttet Origlaals 


Total 
Numbsr of 
Schools That 
Participated 


Bcfort 
Sob id to Hob 


After 

Subftitnttoa 


Total 


Not 

Eligible 


Participated 


Provided 


Participated 


Alabama 


65.71 


9238 


107 


1 


70 


33 


28 


98 


Arizona 


98.73 


98.73 


109 


5 


103 


0 


0 


103 


Arkansas 


89.44 


9732 


101 


1 


89 


jR&SfaTT* 


8 


97 


California 


9334 


98.10 


107 


2 


98 




5 


103 


Colorado 


100.00 


100.00 


113 




112 




0 


m 


Connecticut 


99.02 


99.02 


101 








0 


97 


Delaware 


100.00 


100.00 


30 








0 


28 


Dist. of Columbia 


100.00 


100.00 


37 








0 


35 


Florida 


100.00 


100.00 


107 








0 


103 


Georgia 


99.02 


99.02 


106 




102 




0 


102 


Guam 


100.00 


100.00 


6 




6 


UHlt* 


0 


6 


Hawaii 


99.97 


99.97 


57 


5 


51 




0 


51 


Idaho 


84.78 


91.06 


82 


1 


67 


u 


6 


73 


Indiana 


7938 


93.69 


107 


0 


85 


20 


16 


101 


Iowa 


99.06 


99.06 


109 


3 


105 




0 


105 


Kentucky 


9635 


98.13 


112 


6 


102 




2 


104 


Louisiana 


100.00 


100.00 


109 


8 


101 




0 


101 


Maine 


62.15 


8430 


100 




60 


31 


20 


80 


Maryland 


89.41 


9134 


104 




93 


■1 


2 


95 


Massachusetts 


8332 


95.14 


109 




85 




12 


97 


Michigan 


7739 


94.40 


108 




83 


3 


18 


101 


Minnesota 


8139 


92.16 


104 




82 






93 


Mississippi 


9893 


100.00 


102 




98 




1 


99 


Missouri 


92.16 


99.02 


107 




98 


I 




105 


Nebraska 


75.19 


8533 


122 


10 


73 




■ | 


85 


New Hampshire 


79.86 


91.67 


78 


1 


62 


H 


9 


71 


New Jersey 


6936 


77.73 


108 


2 


75 


27 


9 


84 


New Mexico 


7735 


9356 


93 




65 




15 


84 


New York 


8057 


83.48 


108 




84 




3 


87 


North Carolina 


9430 


98.10 


108 




99 


1 


4 


103 


North Dakota 


7837 


96.78 


80 




S5 


16 


15 


70 


Ohio 


7731 


89.48 


110 


~ r .) tj, [TJ 


85 


20 


14 


99 


Oklahoma 


81.77 


9839 


110 


3 


88 


17 


17 


105 


Pennsylvania 


80.78 


9433 


107 


2 


84 




14 


98 


Rhode Island 


85.04 


99.66 


57 


5 


44 




7 


51 


South Carolina 


94.10 


97.17 


105 


0 


99 




3 


102 


Tennessee 


87.46 


9132 


106 


2 


91 


10 


4 


95 


Texas 


95.14 


99.03 


107 


3 


99 


5 


4 


103 


Utah 


100.00 


100.00 


88 


3 


85 




0 


85 


Virginia 


97.17 


97.17 




* 


103 




0 


103 


Virgin Islands 


100.00 


100.00 


« 


a— i 


6 




0 




West Virginia 


100.00 


j 100.00 






104 




0 




Wisconsin 


100.00 


100.00 


109 








0 






99.04 


99.04 


66 


y_ 




HHK 


0 





0*1 


































After the student sample was selected, the administrator at each school identified 
students who were incapable of taking the assessment because they were either disabled or 
unable to speak English. More details on the procedures for student exclusion are presented in 
the report on field procedures for the Trial State Assessment Program. 

When the assessment was conducted in a given school, a count was made of the number 
of nonexduded students who did not attend the session. If this number exceeded three students, 
the school was instructed to conduct a make-up session, to which were invited all students who 
were absent from the initial session. 

Tables 3-13 and 3-14 provide the distribution of the fourth-grade and eighth-grade 
mathematics student samples and response rates by state. 
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Table 3-13 

Distribution of the Grade 4 Mathematics Student Sample and Response Rates by State 












Table 3-14 

Distribution of the Grade 8 Mathematics Student Sample and Response Rates by State 



State 


Weighted Student 
Response Rate 
(Percent) 


Number of Students 


In Original 
Sample 


Excluded from 
Sample 


To Be Assessed 


Artoally 

Assessed 


Alabama 


95.43 


■I 




2,748 


2322 


Arizona 


92.71 






2,812 


2,617 


Arkansas 


94.01 


2,978 


176 


2,717 


2356 


California 


91.85 


3,101 


246 


2,763 


2316 


Colorado 


93.18 


3,199 


136 


3,006 


2,799 


Connecticut 


93.73 


3,029 


192 


2,783 


2,613 


Delaware 


91.83 


2320 


97 


2,098 


1,934 


District of Columbia 


84.95 


2317 


225 


2,137 


1,816 


Florida 


90.67 


3,073 


199 


2,812 


2349 


Georgia 


92.96 


3,011 


137 


2,787 


2389 


Guam 


89.74 


1,734 


72 


1,667 


1,496 


Hawaii 


89.69 


2,904 


142 


2,724 


2,454 


Idaho 


94.59 


2,936 


91 


2,799 


2,615 


Indiana 


94.29 


3,000 


140 


2,820 


2,659 


Iowa 


95.29 


3,133 


129 


2,959 


2,816 


Kentucky 


95.52 


3,087 


135 


2,883 


2,756 


Louisiana 


9234 


3,028 


120 


2,794 


2382 


Maine 


9332 


2,838 


124 


2,698 


2320 


Maryland 


92.00 


2,803 


128 


2,605 


2399 


Massachusetts 


9335 


2,909 


217 


2,623 


2,456 


Michigan 


93.65 


3,020 


184 


2,793 


2,616 


Minnesota 


9432 


2,758 


92 


2,619 


2,471 


Mississippi 


94.75 


2,958 


207 


2,636 


2,498 


Missouri 


94.97 


2,984 


128 


2,815 


2,666 


Nebraska 


9531 


2343 


108 


2392 




New Hampshire 


9336 


2,958 


156 


2,755 


2382 


New Jersey 


94.06 


2306 


169 


2307 


2,174 


New Mexico 


92.98 


3,041 


163 


2,780 


2361 


New York 


91.90 


2381 


193 


2347 


2,158 


North Carolina 


9435 


3,071 


102 


2,936 


2,769 


North Dakota 


95.89 


2313 


63 


2,418 


2314 


Ohio 


92.71 


2,942 


177 


2,732 


2335 


Oklahoma 


79.99 


2,934 


184 


2,710 


2,141 


Pennsylvania 


94.16 


2,964 


127 


2,806 


2,640 


Rhode Island 


9232 


2,481 


119 


2389 


2,120 


South Carolina 


9331 


3,057 


174 


2308 


2,625 


Tennessee 


94.10 


2,838 


137 


2,644 


2,485 


Texas 


9336 


3,048 


205 


2,794 


2,614 


Utah 


93.74 


3,124 


141 


2,910 


2,726 


Virginia 


9438 


3,091 


153 


2,872 


2,710 


Virgin Islands 


9238 


1,708 


86 


1,601 


1,479 


West Virginia 


i 94.48 


3,097 


178 


2,843 


2,690 


Wisconsin 


93.68 


3,165 


130 


3,002 


2,814 


Wyoming 


94.78 


2,743 


107 


2376 


2,444 
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Chapter 4 

STATE AND SCHOOL COOPERATION AND FIELD ADMINISTRATION 



Nancy Caldwell 
Westat, Inc. 



4.1 OVERVIEW 

By volunteering to participate in the Trial State Assessment and in the field test that 
preceded it. each state assumed responsibility for securing the cooperation of the schools 
sampled by NAEP. The participating states were responsible for the actual administration of 
the 1992 Trial State Assessment at the school leveL For the field test in 1991, however, 
individual states could choose to have NAEP administer the entire program. This chapter 
describes state and school cooperation and field administration procedures for both the field test 
and the 1992 program. Section 42 presents information on the field test in 1991, while section 
4.3 focuses on the 1992 Trial State Assessment. 



42 THE FIELD TEST 
4 2.1 Conduct of the Field Test 

In preparation for the 1992 state and national assessment programs, a field test of the 
forms, procedures, and booklet items was held in early 1991. The field test also gave states an 
opportunity to learn about their responsibilities for the new aspects of the Trial State 
Assessment. 

In June 1990, letters were sent from the U.S. Department of Education to all Chief State 
School Officers inviting them to participate in the field test of materials and procedures for 
1992. Since the fourth grade had not been assessed as part of the Trial State Assessment 
before, states were given the option of conducting the field test themselves this grade. Only 
states that had not participated in the 1990 assessment at the eighth grade were given the option 
of conducting the fie j test themselves. Otherwise, NAEP staff were to conduct the field test. 

In an effort to secure the participation of more schools and to lessen the burden of 
participation on the states, ETS and Westat offered to perform all of the work involved, 
including communicating with school staff, sampling, and administering the assessment. 

Twenty-four jurisdictions decided to participate in the field test. Twenty-one of the 
jurisdictions decided to have NAEP administer all field test sessions. In these jurisdictions, the 
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state coordinator secured the c r operation of the selected schools and then Westat contacted the 
schools, confirmed the schedule and arrangements, selected the student samples, and conducted 
the assessment sessions. Three states— Florida, Kentucky, and Wisconsin— chose to have 
school staff (assessment administrators) conduct the fourth-grade assessments. Westat 
conducted training session for the assessment administrators in these states. None of the 
jurisdictions elected to conduct the eighth-grade assessments themselves. 

Each participating jurisdiction was asked to appoint a state coordinator to secure the 
cooperation of sampled schools, and to be the liaison between NAEP/Westat staff and the 
participating schools. 

As described in Chapter 3, the state coordinator for each state was sent the names of 
approximately 30 pairs of selected schools and requested to secure the cooperation of one 
school from each pair. This process had been used successfully in the field test for 1990, and 
was again successful in the field test for 1992. In total, 664 schools agreed to participate in the 
field test; in 662 of these schools assessment sessions were conducted. 

As stated earlier, Florida, Kentucky, anu Wisconsin chose to administer the new 
components of the assessment, fourth-grade reading and mathematics, in order to gain 
experience with the procedures planned for 1992. The rest of this section describes the 
procedures implemented in those three states. 

Although the three states were responsible for the actual administration at the school 
level, Westat was responsible for developing the administration materials and procedures and 
for training state staff. Two training sessions were conducted by Westat home office staff in 
each of the three states during mid-January. All assessment administrators received a manual 
before attending one of these training sessions. The training program consisted of a video 
presentation, scripted lecture and training exercises. 

In January 1991, Westat field supervisors selected the student sample for each school 
and prepared an Administration Schedule (roster) of the sampled students. The Administration 
Schedule was sent by the state coordinator to the school two weeks before the scheduled 
assessment date. The other assessment materials were shipped by NCS to arrive two weeks 
before the scheduled assessment date. Upon receiving the Administration Schedule and the 
assessment materials, the assessment administrator followed NAEP procedures to select an 
additional sample of newly enrolled students, identify students who i^ere not capable of 
participating in the assessment, and prepare assessment questionnaires. 

On assessmen* day, the field supervisor observed the assessment and queried the 
assessment administrator about the session, procedures, and materials. Supervisors used an 
Observation Form to record information about the major events related to the assessment and 
the assessment administrators’ opinions and comments. 




422 Results of the Field Test 



The overall desired student participation level for the field test was determined from the 
goal of obtaining 300 student responses for each item to be used in the natio _al assessment and 
500 student responses for each item to be used in the Trial State Assessment Depending on the 
size of the school, the school’s sample numbered approximately 30 to 60 students, who were 
assessed in either one or two sessions. 

Given these goals, the overall desired student participation in both the national and Trial 
State components of the field test was 22,600 students. In actuality, 24,910 students, or about 10 
percent more than required, were assessed. 

The field testing of materials and procedures at the fourth-grade level for the Trial State 
Assessment in the three states provided useful information for NAEP staff in preparation for 
1092. While the sessions went well and 80 to 90 percent of assessment administrators thought 
that the training session, the manuals, and the assessment materials worked well, the 
administrators did make many suggestions for improving these materials and procedures for the 
1992 assessment program. 



43 THE 1992 TRIAL STATE ASSESSMENT 

Forty-one states, the District of Columbia, and two territories volunteered for the 1992 
Trial State Assessment. This is a net increase of four jurisdictions over 1990, with seven newly 
participating in 1992 and three that were in the 1990 assessment deciding not to participate in 
1992. Figure 4-1 identifies the jurisdictions participating in each of the two assessment years. 
As with file field test, each jurisdiction designated a state coordinator to oversee all assessment 
activities in the state. 

Two states — Illinois and Washington — had agreed to participate in the 1992 Trial State 
Assessment, but dropped out before the assessment began, primarily due to a lack of success in 
getting schools in their states to participate. This followed a letter from NCES recommending 
that states obtain at least a 70 percent school cooperation rate in order to meet the guidelines 
for participation. 



43.1 Overview of Responsibilities 

The data collection for the 1992 Trial State Assessment involved a collaborative effort 
between the participating states and the NAEP contractors, especially Westat, the field 
administration contractor. Westat’s responsibilities included 

• selecting the sample of schools and students for each participating state; 

• developing the administration procedures and manuals; 




Figure 4-1 

Participating Jurisdictions, 1990 and 1992 Trial State Assessments 
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• training the state personnel who would conduct the assessments; and 

• conducting an extensive quality assurance program. 

Each jurisdiction volunteering to participate in the 1992 program was asked to appoint a 
state coordinator. In general* the state coordinator was the liaison between NAEP/Westat staff 
and the participating schools. In particular* the state coordinator was asked to 

• gain the cooperation of the selected schools; 

• assist in the development of the assessment schedule; 

• receive the lists of all grade eligible students from the schools; 

• coordinate the flow of information between the schools and the NAEP 
contractors; 

• provide space for the state supervisor to use when sampling; 

• notify assessment administrators about training and send them their manuals; and 

• send the lists of sampled students to the schools. 

At the local school level, an assessment administrator was responsible for preparing for 
and conducting the assessment session(s) in one or more schools. These individuals were 
usually school or district staff and were trained by Westat staff. The assessment administrator's 
responsibilities included 

• receiving the list of sampled students from the state coordinator; 

• identifying sampled students who should be excluded; 

• distributing assessment questionnaires to appropriate school staff; 

• notifying sampled students and their teachers; 

• administering the assessment session; 

• completing assessment forms; and 

• preparing the assessment materials for shipment.. 

Westat hired and trained six field managers and 44 state supervisors, one for each 
jurisdiction. Each field manager was responsible for working with the state coordinators of 
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seven to eight states and for overseeing assessment activities. The primary tasks of the field 
managers were to 

• obtain information about cooperation and scheduling; 

e> make sure the arrangements for the assessments were set and assessment 
administrators identified; and 

• schedule the assessment administrators training sessions. 

The primary tasks of the state supervisors were to 

• select the sample of students to be assessed; 

• conduct in-person assessment administrator training sessions; and 

• coordinate the monitoring of the assessment sessions and makeup sessions. 

Westat also hired and trained an average of eight quality control monitors in each state 
to monitor 50 percent of the assessment sessions. 



4 32 Schedule of Data Collection Activities 



May 15, 1991 Westat sent the samples of schools selected for the National and Trial 

State Assessment to the state coordinators. 

Early August, Westat field managers visited each state to explain the computerized State 

1991 Coord' 'l a tor System, which could be used to keep track of assessment- 

related activities. 



May-November, 

1991 



October- 
November, 1991 



Westat distributed Student Listing Forms, Principal Questionnaires, and 
the list of the schools selected for the Trial State Assessment updated 
with a suggested week of assessment and number and type of sessions. 

State coordinators obtained cooperation from districts and schools. State 
coordinators reported participation status to Westat field managers via 
printed lists or computer files. 

State coordinators sent Student Listing Forms, Supplemental Student 
Listing Forms, and Principal Questionnaires to participating schools. 

Westat selected substitutes for refusals and sent them to state 
coordinators. States reporting the participation status of all schools by 
October 15 received substitutes for refusals by October 31. States 
reporting by October 31 received substitutes by November 15. 
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November 14-17, 
1991 

December 2-20, 
1991 



December 2, 
1991-January 10, 
1992 

January 3-10, 1992 

January 9-31, 1992 

January 20- 
February 14, 1992 

February 3-28, 
1992 

March 2-6, 1992 



State supervisor training sessions were held. 



NAEP state supervisors visited state coordinators to select student 
samples and prepare Administration Schedules listing the students 
selected for each session. 

Westat provided schedule of training sessions and copies of the Manual 
for assessment administrators to state coordinators for distribution. 

State coordinators notified assessment administrators of the date and time 
of training and sent each a copy of the Manual for Assessment 
Administrators. ' 

Quality control monitor training sessions were held. 

Assessment administrator training sessions were held. 

State coordinators sent Administration Schedules to each school two 
weeks before the scheduled assessment date. 

Assessments were conducted. Unannounced visits were made by quality 
control monitors to a predetermined 50 percent of the sessions. 

Makeup sessions were held as necessary. 



433 Preparations for the Trial State Assessment 

The focal point of the schedule for the Trial State Assessment was the period between 
February 3-28, 1992, when the assessments were conducted in the schools. However, as with any 
undertaking of this magnitude, the project required many months of planning and preparation. 

Westat selected the samples of fourth* and eighth-grade schools according to the 
procedures described in Chapter 3. On May 15, 1991, lists of these selected schools and other 
materials describing the Trial State Assessment Program were sent to state coordinators. This 
mailing took place about two months earlier than for the 1990 assessment because state 
coordinators had requested more time to contact districts and schools and schedule the 
assessments. Most state coordinators also preferred that NAEP provide a suggested assessment 
date for each school School listings were updated with this information and were sent to the 
state coordinators, along with other descriptive materials and forms, in early August. 

State coordinators also were given the option of receiving the school information in the 
form of a computer database with accompanying management information software. This 
system enabled the state coordinators to keep track of the cooperating schools, the assessment 
schedule, the training schedule, and the assessment administrators. Coordinators could choose 
to receive a laptop computer and printer or to have the system installed on their own computer. 
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Westat field managers traveled to the state offices to explain the computer system to the state 
coordinators and their staff. All but one state coordinator chose to receive the system. 

Six of the most experienced NAEP supervisors were chosen to be field managers, the 
primary link between NAEP and the state coordinators. In mid-August, the field managers 
visited offices of the state coordinators to explain the computerized system to state staff. The 
field managers kept in frequent contact with the state coordinators as the state coordinators 
secured the cooperation of the selected schools and established the assessment schedule. 

The field managers used the same computer system as the state coordinators to keep 
track of the schools and schedule. The state coordinators sent updates either via computer 
disks, by telephone, or in print to their field manager, who then entered in the information into 
the system. Weekly transmissions were made from the field manager to Westat. 

The state coordinators’ first task was to secure the participation of the selected schools. 
States that had determined the cooperation status of all selected schools by October 15 were 
sent a list of potential replacements for refusals by October 31. States that reported by October 
31 received a list of potential substitutes by November 15. Both printed lists and computer files 
of substitute schools were transmitted to the field managers and state coordinators. (See 
Chapter 3 for more details about school substitution.) 

In mid-November, Westat hired one state supervisor for each participating state. The 
state supervisors attended a training session held in the Washington, DC, area between 
November 14-17, 1991. This training session focused on the state supervisors’ immediate 
tasks — selecting the student samples and hiring quality control monitors. State supervisors also 
were given the training script and materials for the assessment administrators* training sessions 
they would conduct in January so they could begin to become familiar with these materials. 

The state supervisors’ first task after training was to complete the selection of the sample 
of students who were to be assessed in each school All participating schools were asked to send 
a list of their grade-eligible students to the state coordinator by November 15. Sample selection 
activities were conducted in the state coordinator’s office unless the state coordinator preferred 
that the lists be taken to another location. 

Using a preprogrammed calculator, the supervisors generally selected a sample of 30 
students per session type per school The exceptions to this were small schools and states with 
fewer than the necessary 100 eighth-grade or 125 fourth-grade schools. In the states with fewer 
schools, larger student samples were required from schools that participated. 

After the sample was selected, the supervisor completed an Administration Schedule for 
each session, listing the students to be assessed. The Administration Schedules for each school 
were put into an envelope and given to the state coordinator to send to the school two weeks 
before the schedule assessment date. Included in the envelope were instructions for sampling 
students who had enrolled at the schools since the creation of the original list used in sampling. 

During the period from mid-November through December, the state supervisors also 
recruited and hired quality control monitors to work in their states. It was the quality control 
monitor’s job to observe the sessions designated to be monitored, complete an observation form 
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on each session and to intervene when the correct procedures were not followed. In each state, 
half of the sessions were designated to be monitored. This information was known only to 
contractor staff; it was not on any of the listings provided to state staff. 

Approximately 400 quality control monitors were trained in two training sessions held 
during January 3-7 and 7-10, 1992. The first day of each training session was devoted to a 
presentation of the assessment administrators training program by the state supervisors, which 
not only gave the quality control monitors an understanding of what assessment administrators 
were expected to do, but gave state supervisors an opportunity to practice presenting the 
training program. The remaining days of the training sessions were spent reviewing the quality 
control monitor observation form and the role and responsibilities of the quality control 
monitors. 

Almost immediately after the quality control monitor training sessions, the supervisors 
began conducting the assessment administrator training sessions. Each quality control monitor 
attended several of these training sessions, to assist the state supervisor and to become 
thoroughly familiar with the assessment administrator's responsibilities. Almost 10,000 persons 
who were to be assessment administrators were trained in about 500 training sessions across the 
nation. 



To ensure uniformity in the training sessions, Westat developed a highly structured 
program involving a script for trainers, a videotape, and a training example to be completed by 
the trainees. The supervisors were instructed to read the script verbatim as they proceeded 
through the training, ensuring that each trainee received the same information. The script was 
supplemented by the use of overhead transparencies, displaying the various forms that were to 
be used and enabling the trainer to demonstrate how they were to be filled out. 

The videotape, similar to the one used in the 1990 Trial State Assessment, was 
developed by Westat to provide background for the study and to simulate the various steps of 
the assessment that would be repeated by the assessment administrators. The portions of the 
videotape depicting the actual assessment had been taped in a classroom with students in 
attendance to closely simulate an actual assessment session. The videotape was divided into 
sections with breaks for review by the trainer and practice for the trainees. 

The final component of the presentation was the "Training Example." This consisted of 
a set of exercises keyed to each part of the training package. A portion of the videotape was 
shown and then reviewed by the trainer following the script. Then, exercises related to that 
material were completed by the trainees before the next subject was discussed. 

The entire training session generally ran for about three and one-half hours. Sessions 
usually began in the morning and ended with lunch. In 1990, the training sessions had generally 
lasted about five to six hours. Responding to requests from state coordinators and assessment 
administrators, Westat trimmed the training session to one-half day. 

All of the information presented in the training session was included in the Manual for 
Assessment Administrators, developed by Westat. There were two versions of the manual, one for 
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each grade. Copies of the manuals were sent by Westat to the state coordinators at the 
beginning of December so that they could be distributed to the assessment administrators before 
the training sessions. 



43 A Monitoring of Assessment Activities 

Two weeks prior to the scheduled assessment date, the assessment administrator 
received the Administration Schedule and assessment questionnaires and materials. Five days 
before the assessment, the quality control monitor made a call to the assessment administrator 
and recorded the results of the call on the Observation Form. Most of the questions asked in 
the pre-assessment call were designed to gauge whether the assessment administrator had 
received all materials needed and was prepared for the session. 

Pre-assessment calls were made to all schools regardless of whether they were to be 
monitored. If the sessions in a school were not observed, the quality control monitor called the 
assessment administrator three days after the assessment to find out how the session went, to 
obtain the assessment administrator's impressions of the manual, training, and materials and to 
ensure that all post-assessment activities had been completed. 

If the sessions in a school were to be monitored, the quality control monitor was to 
arrive at the school one hour before the scheduled beginning of the assessment to observe 
preparations for the assessment. To ensure the confidentiality of the assessment items, the 
booklets were packaged in shrink-wrapped bundles and were not to be opened until the quality 
control monitor arrived or 45 minutes beforr she session began, whichever occurred first. 

In addition to observing the opening of the bundles, the quality control monitor used the 
Observation Form to check that the following had been done correctly: sampling newly enrolled 
students, reading the script, distributing and collecting assessment materials, timing the booklet 
sections, answering questions from students, and preparing assessment materials for shipment. 

After the assessment was over, the quality control monitor obtained the assessment 
administrator’s opinions of how the session went and how well the materials and forms worked. 

If four or more students were absent from the session, a makeup session was to be held. 
If the original session had been monitored, the makeup session was also monitored. This 
required coordination of scheduling between the quality control monitor and assessment 
administrator. 



4.3.5 School and Student Participation 

Table 4-1 shows the results of the state coordinators’ efforts to gain the cooperation of 
the selected schools. Overall, almost 9,000 schools— 4,921 for grade 4 and 3,798 for grade 
8— participated in the 1992 Trial State Assessment. This is about 88 percent (unweighted) of 
the eligible schoob in the original sample at each grade and about 95 percent (unweighted) of 
the sample after substitution. 
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Table 4-1 



School Participation, 1992 Trial State Assessment 



Status 


Grade 4 


Grade 8 


Schools in original sample 


5356 


4118 


Schools not eligible (e.g-, closed, no grade 4/8) 


152 


128 


Eligible schools in original sample 


5204 


3990 


Noncooperating (e.g., school, district, state refusal) 


605 


471 


Participating 


4599 


3519 


Substitutes provided for noncooperating schools 


501 


409 


Participating substitutes 


322 


279 


Total schools participating after substitution 


4921 


3798 



Table 4-2 



Student Participation in the 1992 Trial State Assessment of Mathematics 





Grade 4 


Grade 8 


Status 


Mathematics 


Mathematics 


Sampled 


128,770 


129,239 


Original sample 


125,008 


125,725 


Supplemental sample 


3,762 


3,514 


Withdrawn 


5,545 


6,087 


Excluded 


6,424 


6,532 


To be assessed 


116,801 


116,620 


Assessed 


111,276 


108,557 


Initial sessions 


110,970 


107,461 


Makeup sessions 


306 


1,096 
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Participation results for students in the 1992 Trial State Assessment are given in Table 
4-2. Approximately 129,000 students were sampled in each grade. As can be seen from the 
table, the original sample, which was selected by the NAEP state supervisors, comprised about 
125,000 of this number, l.ie original sample size was increased somewhat after the 
supplemental samples had been drawn (from students newly enrolled since the creation of the 
original lists). 

Assessment administrators removed some students from the total sample according to 
NAEP criteria: first, those students who had left their schools since the time that they were 
sampled (withdrawn); then, those judged incapable of participating meaningfully in the 
assessment by school staff (excluded). A student could be excluded if she or he either had an 
Individualized Education Plan (IEF) or was classified as Limited English Proficient (LEP), was 
incapable of participating meaningfully, and met certain other criteria. 

These exclusions left 116,801 fourth graders and 116,620 eighth graders to be assessed in 
mathematics. Of these, 111,276 fourth graders and 108,557 eighth graders were assessed, 
yielding unweighted student participation rates of 95.3 percent and 93.1 percent, respectively. 



43.6 Results of the Observations 

During the assessment sessions, the quality control monitors were to note instances when 
the assessment administrators deviated from the prescribed procedures and whether any of 
these deviations were serious enough to warrant their intervention. Quality control monitors 
reported no instances where there were serious breaches of the procedures or major problems 
that would question the validity of the assessment. 

Prescribed procedures were most often deviated from in the administrator’s reading of 
the script that introduced the assessment and provided the directions. Even so, in at least 90 
percent of the observed sessions the assessment administrator read the script verbatim or with 
only slight deviations. Examples of major deviations included skipping sections of the script, 
adding substantially to the script, and forgetting to pass out materials at the appropriate times. 
The quality control monitor intervened in these instances. 

Most of the other procedures that could have had some bearing on the validity of the 
results were adhered to very well by the assessment administrators. In 99 percent of the 
observed sessions, the assessment administrators opened the bundles of booklets at the 
appropriate time and handled questions from the students correctly. Ninety-nine percent of the 
fourth-grade and 98 percent of the eighth-grade sessions were timed correctly. 

In 95 percent of the observed mathematics sessions at both grades, the assessment 
administrator handled the distribution and collection of calculators without problems. In 95 
percent of the fourth-grade mathematics sessions and 97 percent of the eighth-grade 
mathematics sessions, the assessment administrator conducted the calculator training without 
problems. 
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After the assessment session was over, assessment administrators were asked how they 
thought the assessment went and whether they had any comments or suggestions. Overall, 
assessment administrators stated that they thought 98 to 99 percent of the sessions went very 
well or satisfactorily. 

Assessment administrators reported that fewer of the fourth-grade mathematics sessions 
(74%) went very well compared with the eighth-grade sessions (86%). The percent of 
monitored sessions versus unmonitored sessions that assessment administrators thought went 
very well was slightly higher at the fourth grade (78% compared to 70%), but remained the 
same (86%) for the eighth grade. 

Comments about the assessment materials and procedures were generally favorable. 
Criticisms or suggestions included that there were too many forms and too much paperwork; 
coding the booklet covers was tedious and problematic for students; and schools needed more 
information about NAEP and assessment results. 

In addition to these interviews, Westat sent a debriefing form to all of the NAEP state 
supervisors and met in person with half of them. This meeting produced suggestions for future 
assessments, especially many minor changes in the procedures, materials and training plans. In 
addition, the state supervisors recommended that district and particularly school staff receive 
more information describing the background and objectives of NAEP and the Trial State 
Assessments. They also stated that many school staff were very interested in results for their 
students, or at least summary results for their state. 

State coordinators were also sent a questionnaire about their experiences, suggestions, 
and comments. State coordinators from 39 of the participating states and territories responded. 
All of the 35 state coordinators responding to the question "How did the assessments go in your 
state?" said "Very well" to "Fairly well" They also commented favorably on the training 
package and other materials. like the assessment administrators, the state coordinators 
criticized the amount of work required to prepare for the assessments. They made many other 
suggestions about the computerized data system, sampling procedures, training program, and 
design of the assessment. All of these suggestions will be reviewed as future assessments are 
planned. 

The results of the assessment and comments from assessment administrators and state 
coordinators were summarized in a report presented to the NAEP Network on May 11, 1992. 

In mid-August, each participating state and territory received a summary of its participation 
data, data collection activities, results of the assessment, and assessment administrators’ 
comments. 
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Chapter 5 

PROCESSING ASSESSMENT MATERIALS 



Dianne Smrdel, Linda Reynolds, and Brad Thayer 
National Computer Systems 



S.1 OVERVIEW 

This chapter describes the printing, distribution, receipt, processing and final disposition 
of materials for the mathematics portion of the Trial State Assessment. The scope of the effort 
required by National Computer Systems (NCS) to process the materials is evidenced by the 
following: 

• Prior to the assessment, 13,448 bundles of assessment booklets at grade 4 and 
13,070 bundles at grade 8 were created and distributed to approximately 9,000 
schools. 

• For the approximate^ 111,000 students assessed for grade 4, about 222,000 
assessment booklets and 35,800 questionnaires were received and processed; and 
about 2,000,000 student responses from 59 constructed-response items were 
professionally scored. 

« For the approximately 109,000 students assessed for grade 8, about 218,000 

assessment booklets and 22,500 questionnaires were received and processed; and 
about 2,050,000 student responses from 65 constructed-response items were 
professionally scored. 

• In all, approximately 7 million double-sided pages from test booklets and 
questionnaires were optically scanned. 

Throughout the processing, the NCS Process Control System and Workflow Management 
System were used to track, audit, edit, and resolve characters of information. A quality control 
sample of characters of transcribed data was selected and compared to the actual responses in 
the assessment booklets. 

The volume of collected data and the complexity of the Trial State Assessment 
processing design, with its spiraled distribution of booklets, as well as the concurrent 
administration of this assessment and the national assessments, required the enhancement and 
implementation of flexible, innovativeiy designed processing programs and a sophisticated 
Process Control System. This system, developed for the 1990 assessments, allowed an 
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integration of data entry and workflow management systems, including carefully planned and 
delineated editing, quality control, and auditing procedures. 

The magnitude of the effort is apparent when considering that the activities described in 
this chapter were completed concurrently with the processing of the national assessments, that 
all processing activities were completed within 10 weeks, and that an estimated accuracy rate of 
fewer than five errors for every 10,000 characters of information was achieved. 

Several major changes in materials processing were made from 1990, including the 
conversion of all documents to scannable form, the tailoring of shipments to the individual size 
and requirements of schools, and the reorganization of the process flow to conduct constructed- 
response scoring after all machine scoring and data verification processes were complete, 
allowing NCS to provide Westat and ETS with demographic and cognitive data at an earlier 
date. 



52 PROCESS CONTROL SYSTEM 

NCS maintains a Process Control System consisting of numerous specialized programs 
and processes to accommodate the unique demands of concurrent assessment processing and a 
unified ETS/NCS system integration. The Process Control System, which was developed for the 
1990 assessment, was necessary to maintaining control of all shipments of materials to the field, 
ot all receipt from the field, and of any work in progress. Hie system is a unique combination 
of several reporting systems currently in use at NCS, along with some application-specific 
processes. These systems are the Workflow Management System, the Bundle Assembly Quality 
Control System, the Outbound Mail Management System, and the On-line Inventory Control 
system. Data were collected from these systems and recorded in the file called the "NAEP 
Process Control System," Additional information was directly entered into the Process Control 
System. 



53 WORKFLOW MANAGEMENT SYSTEM 

The functions of the Workflow Management System are to keep track of where the 
production work is and where it should be and to collect data for status reporting, forecasting, 
and other ancillary subsystems. The primary purpose of the Workflow Management System is 
used to analyze the current workload by project across all work stations. 

The data processing and control systems are determined to a large extent by the type of 
documents processed. For the Trial State Assessment, only machine-scannable assessment 
booklets and answer documents were used to collect student responses. The five questionnaires 
that were used to collect data about school characteristics, teachers associated with sampled 
students, and students excluded from the assessment were also scannable documents. 
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5.4 PROCESS FLOW Cl NAEP MATERIALS AND DATABASE CREATION 

Figure 5-1 shows the conceptual framework of processes that were used both for the 
Trial State Assessment materials and for the national NAEP materials. 

Se don I of Figure 5-1 depicts the flow of NAEP's printed materials. Information from 
the Administration Schedule and Packing List was used to control the processing of materials. 
The figure follows the path of each assessment instrument — Student Test Booklets, School 
Characteristics and Policies Questionnaires, Teacher Questionnaires, Excluded Student 
Questionnaires, Packing List, and Administration Schedules— as they were tracked through the 
appropriate processes that resulted in the final integrated NAEP database. 

The remainder of this chapter provides an overview of the materials processing activities 
as shown in Section I of Figure 5-1 and detailed in Figure 5-2. Section II of Figure 5-1 depicts 
the evolution of the NAEF/NCS database from the transcribed data to the final files, provided 
to Westat for creation of weights and to ETS for analysis and reporting. 

The 1992 NAEP data collection resulted in six classes of data ales (student, school, 
teacher, excluded student, sampling weight, r.nd item information files). The structure and 
internal data format of the 1992 NAEP database was a continuation of the integrated design 
originally developed by ETS in 1983. 



5.5 MATERIALS DISTRIBUTION 

The use of bar code technology in document control was introduced to NAEP by NCS in 
the 1990 assessment; its use continued in 1992. Bar codes were applied to the front cover of the 
documents. The bar code consisted of the two-digit booklet number, a five-digit sequential 
number, and a check digit. It was unnecessary to pre-identify the estimation booklets with bar 
codes because students were instructed to grid the estimation booklet cover with the 
identification number of their original booklet. 

The booklets were spiraled into 26 unique bundles consisting of 11 booklets in a set 
pattern. A header sheet was attached to each bundle that indicated the assessment type, bundle 
type, bundle number, and a list of the booklet types to be included in the bundle. 

The bundle numbers on the header sheet were created to identify the type of bundle. 

All bundles were then passed under a scanner programmed ;o interpret this type of bar code 
and the file of scanned barcodes was transferred from the scanner to the mainframe. A 
computer program compared the bundle type expected to the one actually scanned after the 
header and verified that there were 11 booklets in each bundle. Any discrepancies were printed 
on an error listing forwarded to the Packaging Department, where the error was corrected and 
the bundle was again read into the system for another quality control check. This process was 
repeated until all bundles were correct. 

The bundles were shrink-wrapped in clear plastic. The estimation booklets were also 
shrink-wrapped in groups of 11. A bundle of pre-identified booklets and a bundle of estimation 
booklets were strapped together, A bright label was placed over the cross of the straps that 
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Figure 5-1 

Data Flow Overview, 1992 Trial State Assessment 
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read "Do Not Open Until 45 Minutes Before Testing." Following this, bundles were ready for 
assignment and distribution. 

When packing lists for distribution of materials were created from the Materials 
Distribution System, a second and more detailed bundle slip was produced. This bundle slip 
indicated the same information as the slip wrapped with the bundle, in addition to the school 
number and the complete booklet ID numbers of the booklets within that bundle. This allowed 
the assessment administrators to pre-assign booklets for their sessions. 

The timing of the shipments of these materials to the participating schools was critical, 
since the shipments needed to be in the school at least one week but not more than two weeks 
prior to testing. As in 1990, calculators were in limited supply. Therefore, shipments for 
assessments occurring during the last week could not be completed until shipments from the first 
week's assessments were returned. This affected the shipments for both grade 4 and grade 8. 

Each school conducted at least one session; some conducted more than one. The 
materials needed for a school to conduct all of its mathematics sessions were sent in one 
shipment. The booklets for the fourth-grade reading session(s) were boxed separately in the 
same shipment. In 1990, each session’s materials had been shipped independently. Although 
this change in shipment practice eliminated the option to pre-assemble many materials, it did 
cause less confusion within the schools. 



Some materials were distributed per school; others were distributed per session. 
Materials issued per session were: 



Btmdle(s) of 11 assessment booklets (based on 
sample count) 

6 Scientific calculators (grade 8) per bundle of 
booklets 

6 Simple calculators (grade 4) per bundle of 
booklets 
15 Protractors 

1 Cassette tape for estimation booklet 
1 Digital timer 

1 Calculator poster 

Those materials distributed by school were: 

2 Roster of Questionnaires 
2 Assessment Notifications 
1 Pre-addressed envelope 



1 Mathematics poster 
1 Tape recorder with batteries 
5 Rulers 

5 Sets of geometric shapes 
1 Pad of appointment cards 
1 Return postage paid label 
1 Post-it note pad 
1 Shipping tape 

5 Excluded Student Questionnaires 
5 Teacher Questionnaires 



1 School Characteristics and Policies 
Questionnaire 
1 Pre-addressed box 



Shipments were sent according to the week of assessment. Some schools found they 
needed extra quantities of materials (Le., more excluded student questionnaires or more teacher 
questionnaires) and calls were received requesting these additional materials. 



Aiding in the security of the shipments was the decision to send all shipments, whenever 
possible, through Airborne. NCS is connected to the Airborne system through computer link 
thus expediting tracing of any misdirected shipments. This system provides the date and time of 
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delivery as well as the name of the person who signed for the shipment. All shipments were 
recorded in the Airborne Libra system. If a shipment had to be sent by UPS or the U.S. Postal 
Service, this information was also recorded and transferred to the mainframe. 



5.6 PROCESSING ASSESSMENT MATERIAL 

The materials from each session were to be returned to NCS in the same box in which 
they were originally mailed. It was the responsibility of the assessment administrator in the 
unmonitored schools and the quality control monitor in the monitored schools to repackage the 
items in the proper order, complete all paperwork and return the shipment through the U.S. 
Postal Service, using the postage-paid label provided. 

With approximately 9,000 individual shipments arriving over a four-week period, it was 
necessary to devise a system that would quickly acknowledge receipt of a school’s material. A 
label applied to the outside of the box by the NCS packaging department contained a bar code 
which indicated the school number and the project number. When the shipment arrived at NCS, 
the bar code was read and the shipment forwarded to the receiving area. The file was then 
transferred to the mainframe through a PC link and a computer program was used to apply the 
shipment receipt date to the appropriate school within the Process Control System. This 
provided current status of shipments received regardless of any processing backlog. This 
information was then transferred electronically to Westat. The status of the administration was 
checked and in some cases a trace was initiated on the shipment. 

Receiving personnel also checked the shipment to verify that the contents of the box 
matched the school and session indicated on the label Each shipment was checked for 
completeness and accuracy, regardless of whether it was monitored or unmonitored. 

The materials were checked against the Packing List (see Figure 5-3) to verify that all 
materials were returned. If any discrepancies were round, an alert was issued. If all assessment 
instruments were returned, processing continued. Quantities of scientific calculators were in 
short supply; therefore, during the first two weeks of the assessment, calculators were taken 
from the incoming shipments and returned to the packaging area to be included in other 
shipments for the last weeks of testing. 

Each booklet and Excluded Student Questionnaire was verified against the 
Administration Schedule. This included verification of all counts of booklets returned and the 
matching of information on the front cover of the booklets to that on the Administration 
Schedule. If any discrepancy was discovered, an alert was issued. The same verification was 
followed to assure that one estimation booklet was received for every student assessed, and that 
the correct booklet number was gridded on the front cover to assure matching. 

After the contents of the shipment had been identified and verified, the information 
from the Administration Schedule was entered into the Process Control System. That 
information included school number, session code, counts of the number of students in original 
sample, supplemental sample, total sample, withdrawn, excluded, to be assessed, absent, original 
assessed, assessed in makeup and total assessed. If a makeup session was expected, an 
information alert was issued to facilitate tracking. The control counts were used by NCS for 
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verification of processing counts. This information was also transferred electronically to Westat 
on a weekly basis to be used to produce participation statistics for the states. 

If quantities and individual information matched, the booklets were organized into work 
units and batched for processing. The processing flow was changed in 1992, resulting in the 
completion of the machine scoring prior to the constructed-response scoring. Each batch, 
consisting of multiple sessions, was assigned a unique batch number. The batch number was 
entered on the Workflow Management System, facilitating the internal tracking of the session 
and allowing departmental resource planning. A scannable session header, included in the 
shipment from the school, was coded with the session code and placed on top of the stack of 
documents. AH student documents were forwarded to machine scanning functions. Control 
documents were forwarded to appropriate record filing systems. 

The estimation block was administered in a separate booklet (Book M29T). The cover 
of Book M29T contained a section in which students recorded the booklet ID of their assigned 
assessment booklet. The openers verified that the corresponding booklet ID was correctly 
recorded on the cover and that an estimation booklet had been issued to each assessed student. 
As Book M29T contained no constructed responses, they were batched separately and forwarded 
to the scanning area. 

The excluded student questionnaires and teacher questionnaires were compared to the 
Roster of Questionnaires and the Administration Schedule to verify demographic information. 
Some questionnaires may not have been available for return with the shipment. These were 
returned to NCS at a later date in an envelope provided for that purpose. If the Excluded 
Student Questionnaire was not returned with the shipment of booklets, a record containing all 
demographic information on that student from the Administration Schedule was entered into 
the Process Control System. If the questionnaire was subsequently returned, this record was 
deleted. Otherwise, the record was provided to Westat for use in the weighting process. 

Each school characteristics and policies questionnaire was compared with the Roster of 
Questionnaires and the school number was verified to match all other materials in the shipment. 
As with the other questionnaires, this document may not have been returned with the shipment 
and could also be returned in the supplemental envelope. There was no additional effort made 
to collect or report information on unretumed school questionnaires. 

All assessed and absent students were assigned a test booklet To indicate an absence, 
the "A" bubble in the Administration Code column on the front cover of the booklet was 
gridded. The booklet was then processed with assessed student booklets to maintain session 
integrity. 

The Packing List (Figure 5-3) was used by the schools to account for all materials 
received from and returned to NCS. Any discrepancies in quantities received or returned to 
NCS were indicated. Also indicated was whether a makeup session was to be held, the date of 
scheduled makeup, the number of students involved, and the quantities of materials being held 
for later return. 
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Packing List, 1992 Trial State Assessment 




PLEASE RETURN ALL UNUSED MATERIALS 





The Administration Schedule contained the demographic characteristics of the students 
selected for the assessment. This information included the sex, race/ethnicity, birth date, and 
IEP/LEP indicators. The booklet number of the student selected was recorded on the 
Administration Schedule during the assessment process, and the demographic information was 
transferred to the booklet covers by either the student or the assessment administrator. 

The demographics of the sampled students who did not participate in the assessment 
(exclusions and absentees) were provided to Westat to be used to adjust the sampling weights of 
the students who did participate. The excluded student information was obtained from the 
excluded student questionnaire or provided on a file for those not returned to NCS. The absent 
student information was taken from the front cover of the booklet that was assigned prior to the 
start of the assessment. This procedure eliminated the need for an additional form for absent 
students. 

For the Rosters of Questionnaires, two numbers were entered for each type of 
questionnaire: number of questionnaires expected and number actually received. The Packing 
List, Administration Schedule, and Roster of Questionnaires were forwarded to the operations 
coordinator and filed by school within state for future reference. 



5.7 PROFESSIONAL SCORING 

The 1992 Trial State Assessment in mathematics contained three different types of 
cognitive items: extended constructed-response, short constructed-response, and multiple-choice. 
These items were administered in scannable assessment booklets that were identical to those 
used for the fourth- and eighth-grade national assessments. 

Scores for the constructed-response items were gridded by the readers on separate, 
scannable scoring sheets, one sheet per booklet. As batches of test booklets cleared the editing 
process, scoring sheets for each batch of booklets were automatically generated by the system. 
Since the system had already captured all scannable information from each test booklet, scoring 
sheets could be generated for only those student booklets for which the student was present and 
eligible for the assessment. At the same time that the full set of scoring sheets were generated, 
a 20 percent (minimum) subset of booklets were selected at random by the system for reliability 
scoring. A separate set of scoring sheets was generated for these booklets. 

Once a batch of scoring sheets was matched with the corresponding batch of student 
assessment booklets, the booklets were forwarded to the professional scoring area. The scoring 
of the Trial State Assessment was conducted simultaneously with the scoring of the mathematics 
portion of the national program. The same readers scored the constructed-response items from 
both programs. 



5.7.1 Description of Scoring 

Each constructed-response item had a unique scoring guide that identified the range of 
possible scores for the item and defined the criteria to be used in evaluating the students’ 
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responses. Team leaders reviewed discrepancies between readers and reviewed decisions 
regularly so that all readers scored each item similarly. 

The readers scoring the short constructed-response items were organized into eight 
teams, each comprising 11 readers and a team leader. These teams scored responses to 84 
discrete short constructed-response items at the fourth and eighth grades. Of these items, 27 
were categorized as right/wrong, while the i/emaining 57 items included several different 
categories of correct and incorrect responses. For the items scoicd as right/wrong, a correct 
response was scored as 8 and an incorrect response was scored as 1. Items with two correct 
responses were given a score of 7 for the second correct response. Various types of incorrect 
responses were also tracked with separate score points. The incorrect responses were assigned a 
score point from 1 to 5 to capture information on the specific types of errors students were 
making. 

The readers scoring the extended constructed-response items were organized into three 
teams. One team comprised 11 readers and one team leader. This team scored responses to 
the 11 discrete extended constructed-response items from the fourth and eighth grades for both 
the national and the Trial State assessments. The other two teams comprised six readers and 
one team leader each. These two teams scored responses to 17 discrete extended constructed- 
response items from the fourth, eighth, and twelfth grades. (Figure 54 shows the text of an 
eighth-grade extended constructed-response item and its scoring guide.) The extended 
constructed-response items were scored on a rising scale of 1 to 4. Responses that were 
"off-task" or completely incorrect received a score of 9. In this way, information was captured 
about what parts of an item students were or were not able to complete correctly. 



5.7.2 Training 

Hie readers were trained to ensure that they would reliably score the constructed- 
response items. The training, which was conducted during a one-week period, familiarized the 
group with the scoring guides in order to reach a high level of agreement among the readers. 

Before the training program began, the team leaders worked with ETS mathematics test 
development staff to prepare training sets (sets of sample responses to accompany the scoring 
guides). Training involved explaining each item and its scoring guide to the readers and 
discussing responses that were representative of the various score points in the guide. The 
training was conducted by ETS mathematics test development specialists with assistance from 
the team leaders. Following the explanations, the readers scored and discussed 5 to 35 carefully 
selected "practice papers" for each item, depending on the complexity of the item. Next, each 
reader practiced by scoring all the constructed-response items in each of approximately 12 
bundles of book^is, with an average of 54 booklets per bundle. (It was not necessary to use this 
method of training for the items with straight numeric answers since determining the correctness 
of the student responses for these items was straightforward and fully explained within the 
scoring guide.) During this practice, discussion sessions were held to review responses that 
received a wide range of scores. 
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Figure 5*4 

Sample Extended Constructed-response Item and Scoring Guide 

This question requires you to show your work and explain your reasoning. You may use drawings, words, and numbers 
in your explanation. Your answer should be dear enough so that another person could read it and understand your 
thinking. It is important that you show aU your work. 

Treena won a 7-day scholarship worth $1,000 to the Fro Shot Basketball Camp. Round trip travel expenses to the camp 
are $335 by air or $125 by train. At the camp she must choose between a week of individual instruction at $60 per day 
or a week of group instruction at $40 per day. Treena* food and other expenses are fixed at $45 per day. If she does 
not plan to spend any money other than the scholarship, what are aU choices of travel and instruction plans that she 
could afford to make? Explain your reasoning. 



Solution: 



Treena’s »uced expenses will be 7 x 45 ■ $315 for the 7 days. Therefore, she has 1000 - 315 ■ 685 to spend for 
instruction and travel. The group plan will cost 7 x 40 - 280 while the individual plan will cost 7 x 60 * 420. Treena 
has three options: 



Group and Train: 
Group and Plane: 
Individual and Train: 



280 + 125 - 405 (720)— $280 left 
280 + 335 « 615 (930)— $70 left 
420 + 125 - 545 (860)— 140 left 



She cannot choose the individual plan and travel by plane because her trial expenses would be $1,070 which is greater 
than the allotted scholarship. (This can be considered as a valid concussion but can only be counted in a score of 1 or 
2.) Any full credit response clearly communicates that Treena has three options, what the three options are, and how the 
student arrived at the three options. 



1 a) Student indicates valid conclusions with no mathematical evidence 

OR 

starts some correct mathematics beyond computing fixed cost (7 x 45 * 315) but indicates no conclusion. 

b) Student work contains major mathematical errors or flaws in reasoning. For example: Hie student does not 
consider Treena’s Fixed expenses or does not realize that 40 and 60 must each be multiplied by 7. 

2 a) Student indicates 1 or more correct conclusions; additional supporting computations beyond level 1 must be 

present. The work may contain some computational errors. 

b) Student has correct mathematics for 1 or more options but indicates no conclusion. 

3 a) Student shows correct mathematical evidence that Treena has 3 options, but the explanation is unclear or 

incomplete. 

b) Student shows correct mathematical evidence for any 2 of Treena’s 3 options and the explanation is dear and 
complete. 

4 Full credit response - correct solution and complete, dear explanation 

9 The work is completely incorrect, irrelevant, or off task. (Just computing 7 x 45 ■ 315 is a score of 9.) 

0 » No response (blank) 
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Once the practice session was completed, the formal scoring process began. During the 
scoring, notes on various items were compiled for the readers for their reference and guidance. 
In addition, short training sessions were conducted when the team leaders determined by 
reviewing discrepancies that certain items were causing difficulties for the scorers. 

During the first week of scoring, the team leaders reviewed 25 percent of the responses 
scored by each reader and brought any problems related to scoring to the attention of the 
individual reader. After the first week, leaders continued to review 25 percent of the booklets 
for any readers found to be having difficulties and 10 percent of the booklets scored by the rest 
of the readers. In this way, team leaders could be certain that their teams were scoring 
consistently. When a reader’s score was judged to be discrepant with the scoring guides, the 
team leader discussed the response and its score with that reader. The team leaders also met as 
a group on a daily basis to discuss any problem responses to test items that had arisen in order 
to ensure that all teams of readers were scoring ill items in exactly the same manner. 



5.73 Trend Scoring of 1990 Items 

During the scoring of the 1992 Trial State Assessment in mathematics, a trend scoring 
was conducted using a subsample of student test booklets from the mathematics portion of the 
1990 NAEP. Four blocks of items used in the 1990 assessment were re-administered in 1992. 
Three of these blocks contained open-ended items (a total of 25 short constructed-response 
items). The training for the scoring of these items was conducted using training materials and 
scoring guides identical to those used for the 1990 assessment. 

One hundred booklets from each of the 40 states that participated in the 1990 Trial State 
Assessment were chosen at random to be scored again in 1992. Each of these 4,000 booklets 
contained all three trend blocks. Scoring reliability for the 1990 trend scoring was 95.8 percent. 
This reliability percentage was calculated based on the rate of exact agreement between the 
scores given in 1990 and the scores given to the same student responses in 1992. 



5.7.4 Reliability of Scoring 

Twenty percent of the booklets containing constructed responses (for a total of over 
6,500 responses) were scored by a second reader to obtain statistics on interreader reliability, 
which was determined by calculating the percent of exact agreement between readers. The 
overall interreader reliability for all constructed-response items combined was 94.1 percent. 
Reliabilities for the 11 extended constructed-response items ranged from 693 percent to 90.5 
percent, with an average reliability of 81.1 percent (reliabilities for each of the 11 items are 
given in Table 5-1). For the 82 short constructed-response items, reliabilities ranged from 81.9 
to 99.2 percent, with an average reliability of 96.2 percent. This reliability information was also 
used by the team leaders in monitoring the capabilities of all readers and the uniformity of 
scoring across readers. 

Because the reliability scoring was done on separate scoring sheets, all reliability scoring 
was "blind," or uninfluenced by any score already given. The reliability scoring for each batch of 
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Table 5-1 



Interreader Reliabilities for Extended Constructed-response Items 
in the 1992 Trial State Assessment in Mathematics 



Grade 4 



NAEP ID 


Description 


Content Area 


Interreader 

Reliability 


M041201 


Compare two geometric shapes 


Geometry 


72.6 


M043501 


Explain solution to a problem involving 
counting 


Algebra and Functions 


89.5 


M044401 


Demonstrate understanding of place value 
("Laura Use Calculator") 


Numbers & Operations 


905! 


M0454Q1 


Reason (meaning of fraction) ("Pizza 
Comparison") 


Numbers & Operations 


82.1 


M049001 


Identify correct pictograph ("Graphs of 
Pockets") 


Data Analysis, Statistics, 
& Probability 


76.8 



Grade 8 



NAEP ID 


Description 


Content Area 


Interreader 

Reliability 


M045901 


Solve a problem involving intersecting circles 
("Radio Stations") 


Geometry 


793 


M051101 


Reason to maximize difference 


Numbers & Operations 


693 


M0522Q1 


Show how three figures can be divided to find 
area 


Measurement 


843 


M053101 


Find probability and explain 


Data Analysis, Statistics, 
& Probability 


85.1 


M054301 


Extend pattern to find term ("Marcy Dot 
Pattern") 


Algebra & Functions 


81,0 


M055501 


Plan and analyze expenses with given 
parameters (Treena’s Budget") 


Numbers & Operations 


813 
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booklets was completed by a team other than the one that did the full scoring, m this *ay, the 
full scoring and the reliability scoring was spread across all teams so that all readers were 
compared against all other readers. 



5.8 DATA TRANSCRIPTION SYSTEMS 

The transcription of the student response data into machine-readable form was achieved 
through the use of three separate systems: data entry (scanning), validation (pre-edit), and 
resolution. 



5.8.1 Data Entry 

The data entry process is the first time that booklet level data were input to the 
computer system. As all documents used in the 1992 assessment were scannable documents, the 
data were collected using NCS optical scanning equipment. The data were then edited and 
questionable data were resolved before further processing. 

To ensure data integrity, edit rules were applied to each scanned data field. This 
procedure validated each field and reported all problems for subsequent resolution. After each 
field was examined and corrected, the edit rules were re-applied for final verification. 



5.8.2 Scanning 

After the initial manual verification, the scannable documents were transported to a 
slitting area where the folded and stapled spine was removed from each document. Scanning 
operations were performed by NCS’s HPS Optical Scanning equipment The optical scanning 
devices and software used at NCS permit a complete mix of NAEP scannable materials to be 
scanned with no special grouping requirements. However, for manageability and tracking 
puiposes, student documents, excluded student questionnaires and teacher questionnaires were 
batched separately. In addition to the capture of scannable responses, the bar code 
identification numbers used to maintain process control were also decoded and transcribed to 
the NAEP computerized data file. 

The scanning program is a table-driven software process that uses standard routines and 
application-specific tables to identify and define the documents and formats to be processed. 
When a booklet cover is scanned, the program uses the booklet number to determine the 
sequence of pages and the formats to be processed. By reading the booklet cover, the program 
recognizes which pages should follow and in what order. 

The scanning program wrote four types of data records into the data set: a batch header 
record containing information coded onto the batch header sheet by receipt processing staff; a 
session header record containing information coded onto the session batch h' der sheet by 
receipt processing staff; a data record containing ali of the translated marked ovals from all 
pages in a booklet; and a dummy data record, serving as a place holder in the file for a booklet 
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with an unreadable cover sheet. The document code was written in the same location on all 
records to distinguish them by type. 

The following coding rules were used: 

• The data values from the booklet covers and scorer identification fields were 
coded as numeric data. 

• Unmarked fields were coded as blanks and processing staff were alerted to 
missing or uncoded critical data. 

• Fields that had multiple marks were coded as asterisks (*). 

• The data values for the item responses and scores were returned as numeric 
codes. 

• The multiple-choice, single-response format items were assigned codes depending 
on the position of the response alternative; that is, the first choice was assigned a 
1, the second a 2, and so forth. 

• The circle-all-that-apply items were given as many data fields as response 
alternatives; the marked choices are coded as 1 and the unmarked choices as 
blanks. 

• The fields from unreadable pages were coded with an X as a flag for resolution 
staff to correct. 



5.9 DATA VALIDATION 

The data entry and resolution system used for the Trial State Assessment Program was 
also used for the national assessment program. The systen is able to process materials 
submitted from both scannable and nonscannable media simultaneously for three age groups, 
three assessment types, and five questionnaires. The use of batch identification 
codes— -comprising the school and session codes as well as the batch sequence numbers for 
suspect record identification — facilitated the management of the system and correction of 
incorrectly gridded or keyed information. 

i\s the program processed each data record, it first read the booklet number and 
checked it against the batch session code for appropriate session type. Any mismatch was 
recorded on the error log and processing continued. The booklet number was compared against 
the first two digits of the student identification number. If they disagreed, because of improper 
bar coding, a message was written to the error log. The remaining booklet cover fields were 
then read and validated for the correct range of values. The school codes had to be identical to 
those on the Process Control System record and the grade code had to be either 4 or 8. All 
data values that were out of range were read as is, but flagged as suspect. All data fields that 
were read as asterisks were recorded on the edit log. 



116 



1 O 




Document definition Files describe each document as a series of blocks that are 
described as a series of items. The blocks in a document were traversed in the order that they 
appear on the document. Each block’s fields were validated during this process. If a document 
contained suspect fields, the cover information was recorded on the edit log with a description 
of the suspect data. Some fields (e.g., AGE or DOB), reqi red special types of edits. These 
fields were identified in the document definition fields, and a subroutine was invoked to handle 
these cases. 

The program next cycled through the data area corresponding to the item blocks. The 
task of translating, validating, and reporting errors for each data field in each block was 
performed by a routine that required only the block identification code and the string of input 
data. This routine had access to a block definition file that had the number of fields to be 
processed for each block and the field type (alphabetic or numeric), the field width in the data 
record, and the valid range of values for each field. The routine processed each field in 
sequential order, performing the necessary translation, validation, and reporting tasks. 

The first of these tasks checked for the presence of blanks or asterisks in a critical field. 
These were recorded on the edit log and processing continued with the next field. No action 
was taken on blank-filled fields for multiple-choice items since that code indicated a 
nonresponse. The field was validated for range of response, recording anything outside of that 
range to the edit log. The item type code was used by the program to make a further 
distinction among constructed-response item scores and other numeric data fields. Moving the 
translated and edited data field into the output buffer was the last task performed in this phase 
of processing. 

The completed string of data was written to the data file when the entire document had 
been processed. Then, when the next session header record was encountered, the program 
repeated the same set of processes for that session. The program closed the data set and 
generated an edit listing when it encountered the end of a file. 

Accuracy checks were performed on each batch processed. Every 500th document of 
each booklet form was printed in its entirety, with a minimum of one document type per batch. 
This record was checked, item by item, with the source document for errors. 



5.10 EDITING 

Quality procedures and software throughout the system ensure that the NAEF data are 
correct. The initial editing tha took place during the receipt control process included 
verification of the schools and sessions. Receipt control personnel checked that all student 
documents on the Administration Schedule were undamaged and assembled correctly. The 
machine edits performed during data capture verified that each sheet of each document was 
present and that each field had an appropriate value. All batches entered into the system were 
edited for errors. 

Data editing occurred after these checks and consisted of a computerized edit review of 
each respondent’s document and the clerical edits necessary to make corrections based upon the 
computer edit. This data editing step was repeated until all data were correct. 



The first phase of data editing was designed to ensure that all documents were present. 

A computerized edit list was produced after NAEP documents were scanned and with the 
supporting documentation sent from the field the edit function was performed The hard copy 
edit list contained all the vital statistics about the batch and each school and session within the 
batch, such as the number of students, school code, type of document, assessment code, error 
rates, suspect cases, and record serial numbers. Using these inputs, the data editor verified that 
the batch had been assembled correctly, each school number was correct, and all student 
documents within each session were present. 

During data entry, counts of documents processed by type were generated These counts 
were checked against the Administration Schedule counts entered into the Process Control 
System during the receiving process. The number of assessed and absent students processed 
had to match the number of used booklets indicated on the Process Control System. 

The second phase of data editing was carried out by an experienced editing staff using a 
predetermined set of rules to review the field errors and record corrections to be made to the 
student data file. The same computerized edit list used in the first phase was also used to 
perform this function. 

The editing staff made corrections using the edit log prepared by the computer and the 
actual source document listed on the edit log. The corrections were identified by batch 
sequence numbers and field name for suspect record and field identification. The edit log 
indicated the current composition of the field. This particular piece of information was then 
visually checked against the NAEP source document by the editing staff for double grids, 
erasures, smudge marks or omitted items that were flagged. Each flagged item was handled in 
one of the following ways: 

9 Correctable Error: If the error could be corrected by the editing staff, according 
to the editing specifications, the corrections were indicated on the edit listing. 

• Field Correctable: If an error was not correctable according to the specifications, 
an alert was issued to the operations coordinator for resolution. Once the 
correct information was obtained, th<* correction was indicated on the edit listing. 

« Noncorrectable Error: If an error suspect was found to be correct as stated, :md 
no alteration was possible according to source documents and specifications, the 
programs were tailored to allow this information to be accepted into the data 
record and no corrective action was taken. 

These corrections were noted on the edit list. When the entire batch of sessions was 
resolved, the list was forwarded to the key entry staff. The corrections were entered and 
verified through the Falcon system. When all corrections were entered and verified for a batch, 
an extract program was run to pull the correction records to a mainframe data set. 

The post-edit program was initiated next. This program applied the corrections to the 
specified records and once again applied the error criteria to all records. If there were further 
errors, another edit list was printed and the cycle began again. 
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When the edit process had produced an error-free file, the booklet ID number was 
posted to the NAEP tracking file by school and sessions. This allowed for an accumulation 
process to accurately measure the number of documents processed for a session within a school 
and the number of documents processed by form. The posting of booklet IDs also ensured that 
a booklet ID was not processed more than once. These data allowed the progress of the 
assessment to be monitored and reported on the status report. 

At this point, a job was automatically submitted to produce the NAEP scoring sheets for 
this batch. The program also selected the records to be scored by a second reader for reliability. 
These sheets were printed, matched with the original documents, and forwarded to the NAEP 
scoring area. 

Once all documents for a batch had been scored, the sheets were batched and submitted 
to scanning. A series of edits were run to verify the information on these sheets. The scorer 
identification fields were processed at this point and certain checks were made. The routine 
validated the score range and did not permit a blank field. If no score was indicated or the 
score was out of range, the disparity was noted on the edit log. 

These error logs were returned to the scoring groups for resolution and the corrections 
were entered directly to the files. The edit process was repeated until the file was error free. 

As a final quality control check, ETS identified a random sample of each booklet type 
from the master student file. The designated documents and scoring sheets were located, 
removed from storage and forwarded to ETS for quality control (see Chapter 6). On 
completion of quality control processing, the booklets were returned to NCS for return to 
storage. 



5.11 QUESTIONNAIRES 

The questionnaires were received either with the session shipment or in a later 
shipment. Once the questionnaires were verified with the roster, they were accumulated by the 
receiving clerks. The school characteristics and policies questionnaires, teacher questionnaires 
and excluded student questionnaires were batched and sent to scanning at regular intervals. 
Every effort was made to keep current on all forms, both to ensure the processing of all 
documents for a session and to deliver all data at the same time. 

All documents, regardless of method of entry, were run through the process of error 
identification and resolution. 



5.12 MERGING OF STUDENT DATA 

At the completion of the scoring and verification of the constructed responses, the 
complete records for students were merged. This merge included the machine-scanned data, the 
scores to the constructed responses, and the responses from the estimation booklets. 

Verification of complete student records was conducted prior to the delivery of the data files. 
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5.13 STORAGE OF DOCUMENTS 

Once the editing process had been successfully completed on the batches, they were sent 
to the NCS warehouse for storage. The storage location of all documents was recorded on the 
inventory control system and stored for later retrieval Unused materials were sent to 
temporary storage until the completion of the assessment and acceptance of the data files, at 
which time they were destroyed. 
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Chapter 6 



CREATION OF THE DATABASE 

AND EVALUATION OF THE QUALITY CONTROL OF DATA ENTRY 



John J. Ferris and David S. Freund 
Educational Testing Service 



6.1 OVERVIEW 

The data transcription and editing procedures described in Chapter 5 resulted in the 
generation of disk and tape files containing various data for assessed students, excluded 
students, teachers, and schools. The weighting procedures described in Chapter 7 resulted in 
the generation of data files that included the sampling weights required to make valid statistical 
inferences about the population from which the 1992 fourth- and eighth-grade Trial State 
Mathematics Assessment samples were drawn. These files were merged into a comprehensive, 
integrated database. To evaluate the effectiveness of the quality control of the data entry 
process, the final integrated database was sampled, and the data were verified in detail against 
the original instruments received from the Geld. 

This chapter begins with a description of the transcribed data files and the procedure of 
merging them, or bringing them together, to create the 1992 Trial State Assessment database for 
fourth- and eighth-grade students. The last section presents the results of the quality control 
evaluation. 



62 MERGING FILES INTO THE TRIAL STATE ASSESSMENT DATABASE 

The transcription process conducted by National Computer Systems resulted in the 
transmittal to ETS of four data files for both fourth and eighth grade: one file for each of the 
three questionnaires (teacher, school, and excluded student) and one for the student response 
data. The sampling weights, derived by Westat, Inc., comprised an additional three files for 
each grade— one for students, one for schools, and one for excluded students. (See Chapter 7 
for a discussion of the sampling weights.) These seven files at each grade were the foundation 
for the analysis of the 1992 Trial State Assessment data. Before data analyses could be 
performed, these data files had to be integrated into a coherent and comprehensive database. 

The 1992 Trial State Assessment database for fourth and eighth grade consisted of three 
files — student, school, and excluded student. Each record on the student file contained a 
student’s responses to the particular assessment booklet the student was administered (booklets 
1 to 26), the student’s responses to booklet 29 (a single block of estimation items that was 
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administered to all assessed students), and the information from the questionnaire that the 
student’s mathematics teacher completed. (See Chapter 2 for information regarding assessment 
instruments.) Since teacher response data can be reported only at the student level, it was not 
necessary to have separate teacher files. The school files and excluded student files were 
separate and could be linked to the student files through the state and school codes. 

The creation of the student data files for fourth and eighth grade began with the 
reorganization of the data files received from National Computer Systems. This involved two 
major tasks: 1) the files were restructured, eliminating unused (blank) areas to reduce the size 
of the files; and 2) in cases where students had chosen not to respond to an item, the missing 
responses were recoded as either "omit" or "not reached," as appropriate. Next, the student 
response data were merged with the student weights file. The resulting file was then merged 
with the teacher response data. In both merging steps, the booklet ID (the two-digit booklet 
number and a five-digit serial number) was used as die matching criterion. 

The school file for each grade was created by merging the school questionnaire file with 
the school weights file and a file of school variables, supplied by Westat, which included 
demographic information about the schools collected from the principal’s questionnaire. The 
state and school codes were used as the matching criteria. Since some schools did not return a 
questionnaires and/or were missing principal’s questionnaire data, some of the records in the 
school file contained only school-identifying information and sampling weight information. 

The excluded student file for each grade was created by merging the excluded student 
questionnaire file with the excluded student weights file. The assessment booklet serial number 
was used as the matching criterion. 

When the student, school, and excluded student files for each grade had been created, 
the database was ready for analysis. In addition, whenever new data values, such as composite 
background variables or plausible values, were derived, they were added to the appropriate 
database files using the same matching procedures as described above. 

For archiving purposes, restricted-use data files and codebooks for each state were 
generated from this database. The restricted-use data files contain all responses and response- 
related data from the assessment, including responses from the student booklets and teacher and 
school questionnaires, proficiency scores, sampling weights, and variables used to compute 
standard errors. 



63 CREATING THE MASTER CATALOG 

A critical part of any database is its processing control and descriptive information. 
Having a central repository of this information, which may be accessed by all analysis and 
reporting programs, will provide correct parameters for processing the data fields and consistent 
labeling for identifying the results of the analyses. The Trial State Assessment master catalog 
file was designed and constructed to serve these purposes for the Trial State Assessment 
database. 
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Each record of the master catalog contains the processing, labeling, classification, and 
location information for a data field in the Trial State Assessment database. The control 
parameters are used by the access routines in the analysis programs to define the manner in 
which the data values are to be transformed and processed. 

Each data field has a 50-character label in the master catalog describing the contents of 
the field and, where applicable, the source of the field. Hie data fields with discrete or 
categorical values (e.g., multiple-choice items and professionally scored items, but not weight 
fields) have additional label fields in the catalog containing 8- and 20-character labels for those 
values. 



Hie classification area of the master catalog record contains distinct fields corresponding 
to predefined classification categories (e.g., mathematics content and process areas) for the data 
fields. For a particular classification field, a nonblank value indicates the code of the 
subcategory within the classification categories for the data field. This classification area 
permits the grouping of identically clarified items or data fields by performing a selection 
process on one or more classification fields in the master catalog. 

Hie master catalog file was constructed concurrently with the collection and transcription 
of the Trial State Assessment data so that it would be ready for use by analysis programs when 
the database was created. As new data fields were derived and added to the database, their 
corresponding descriptive and control information were entered into the master catalog. 



6.4 QUALITY CONTROL EVALUATION 

Hie purpose of the data entry quality control procedure is to gauge the overall accuracy 
of the process that transforms responses into machine-readable data. The procedure involves 
examining the actual responses made in a random sample of booklets and comparing them with 
the responses recorded in the final database, which is used for analysis and reporting. 



6.4.1 Student Data 

Twenty-six assessment booklets numbered 1 through 26 and an estimation block 
identified as booklet 29 were administered as part of the Trial State Assessment in mathematics. 
Table 6-1 provides the numbers of each booklet at each grade for which data were scanned into 
data files. These numbers varied somewhat more than in the 1990 assessment, but chi-square 
measures of the variation proved to be insignificant at both grades. 

Since booklet 29 was administered to all students, it was treated for quality control 
purposes as an extension of each of booklets 1 through 26. All data for a selected student were 
collected and examined, including data from booklet 29. 

The number of students assessed in each of the 44 participating jurisdictions varied also. 
At grade 4, 29 jurisdictions met or exceeded the target of 2,500 students and a few smaller 
jurisdictions fell several hundred short of the target. The average number of fourth-grade 
students assessed in each jurisdiction was 2,523. At grade 8, 26 jurisdictions met or exceeded 
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Tabic 6-1 




Number of Assessment Booklets Scanned and Selected for Quality Control Evaluation 



Booklet 

Number 



10 



11 

12 

13 

14 

15 

16 

17 

18 

19 

20 
21 
22 

23 

24 

25 

26 



Total 



Total Booklets Scanned 



Grade 4 



4,211 

4.186 
4,162 
4,169 
4,151 
4,227 
4,211 
4,192 
4,279 
4,323 
4377 
4332 

4388 
4363 
4338 
4375 
4371 
4331 
43 34 
4329 

4389 
4334 
4351 
4321 

4.187 
4,161 

110,992 



Grade 8 



4,129 

4,142 

4,140 

4,167 

4,165 

4306 

4.185 
4,198 
4,177 
4334 
4358 
4321 
4303 
4,163 
4,151 

4.155 
4,179 

4.186 
4310 
4,192 
4325 
4,174 

4.156 
4,125 
4,13S 
4,110 

108386 



Total Booklets Selected 



Grade 4 



11 



Grade 8 



13 



12 



13 

11 



10 



11 



12 

12 

10 



10 



u 



10 

10 

12 

11 

13 



11 

11 

10 

10 



iL 

10 



271 



10 



13 

13 



11 



JL 

10 



12 

iL 

15 



13 



10 

11 



10 



13 



271 
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target of 2,500 students, the same jurisdictions as at grade 4 fell short, and the average number 
of students assessed was 2,468. To simplify the selection of booklets for examination, a method 
was developed that involved selecting all occurrences of a specified booklet in a randomly 
selected "stack." A stack is a unit of collection containing anywhere from 11 to 105 booklets, but 
typically between 50 and 60 booklets, in an assortment related to the spiraling technique used to 
distribute the booklets. The selection method was designed to yield approximately the same 
number of each booklet but, due to the variability in the size and contents of the stacks, there 
was somewhat more variation in the numbers of booklets selected than in the 1990 assessment 
(see Table 6-1). However, all of the booklets were sampled in adequate numbers and the 
average rate of selection was 1/410 at grade 4 and 1/401 at grade 8, a selection rate comparable 
to that used in past assessments at both the state and national levels. The few errors found 
during this quality control examination did not cluster by booklet number, so there is no reason 
to believe that the variation in numbers of booklets selected had a significant effect on the 
estimates of overall error rate confidence limits reported below. 

The quality control evaluation detected only 10 errors total for both grades in these 
booklet samples — six instances of multiple responses that were not identified as such by the 
scanner, and four instances of erasures that were recorded instead of ignored. The usual quality 
control analysis based on the binomial theorem permits the inferences described in Table 6-2. 



Table 6-2 

Inference from the Quality Control Evaluation of Student Data 



Subsample 


Entry 

Type 


Different 

Booklets 

Sampled 


Number of 
Booklets 
Sampled 


Characters 

Sampled 


Number 

of 

Errors 


Observed 

Rate 


99.8% 

Confidence 

Limit 


Grade 4 


Scanned 


26 


271 


33,682 


1 




.0003 


GradeB 


Scanned 


26 


271 


38,888 


9 


B3I 





The grade 8 error rate is about the same as was observed in the 1990 assessment. For 
some reason, the students in grade 4 did not seem to challenge the scanner with erasures and 
optical ambiguities, so the error rate confidence limit for that grade was much lower. Neither 
error rate offers the threat of interference with the validity of any data analyses. As usual, there 
was some indication that the error rates could be improved with further tuning of the scanner 
procedures, but the process as it stands can certainly be described as dean and reliable. A very 
large volume of data was scanned with consistently excellent results. 



6.4.2 Teacher Questionnaires 

A total of 14,553 questionnaires at grade 4 and 11,453 questionnaires at grade 8 were 
collected from mathematics teachers. Questionnaires were sampled at the rate of 1 in 200, 
resulting in the selection of 72 questionnaires at gr e 4 and 58 questionnaires at grade 8. The 
selected questionnaires contained a total of 13 errors, usually involving the scanner’s mistaking 
an erasure for a response, but occasionally involving the failure of the scanner to pick up a 
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multiple response. In evety case, the respondent’s intention was dear to the human eye, but the 
scanner seemed unprepared to exercise the same judgment that a careful observer would. Hie 
result is an error rate for the teacher questionnaire data that is 5 to 10 times as high as for the 
student data. One possible explanation for this is that teacher questionnaires are inherently 
more complex than student assessment booklets, which leads to a much higher rate of erasures 
and other errors by the respondents. Perhaps a redesign of these questionnaires would bring 
the error rate down. This is not to say that the degree of erroneous data in the teacher 
questionnaire files is worrisome, but rather that the student data are much more error-free. 
There is every indication that the quality of the teacher data is more than adequate for the 
purposes to which ii was put. 



6.43 School Questionnaires 

A total of 4,857 questionnaires at grade 4 and 3,699 at grade 8 were collected from 
school administrators. These questionnaires were sampled for quality control evaluation at the 
rate of 1 in 50, resulting in the selection of 97 questionnaires at grade 4 and 74 at grade 8. In 
the 1990 assessment, data from the school questionnaires had been key-entered; for the 1992 
assessment, the completed questionnaires were machine-scanned. It is interesting to compare 
these two very different data entry methods. While the overall error rates were the same, none 
of the errors in the keyed data involved the misreading of erasures or multiple responses; in the 
scanned data, all of the errors were of this type. 

Again, the quality of the data was very good, with an error rate of about half that of the 
teacher questionnaire data. 



6.4.4 Excluded Student Questionnaires 

A total of 13,268 excluded student questionnaires were scanned at grade 4, and 6,454 at 
grade 8. These were sampled at tire rate of about 1 in 200, resulting in the selection of 66 
questionnaires at grade 4 and 33 questionnaires at grade 8. All the errors found were due to 
the scanner’s mistaking an erasure for an intended response. 

The quality of these data appears to be about as high as the other questionnaires — that 
is to say, adequate for the purposes to which it was put. The results of the evaluation of the 
questionnaire data are summarized in Table 6-3. 







Tabic 6-3 



Inference from the Quality Control Evaluation of Questionnaire Data 



Subsample* 


Entry 

Type 


Different 

Booklets 

Sampled 


Number of 
Booklets 
Sampled 


Characters 

Sampled 


Number 

of 

Errors 


Observed 

Rate 


99.8% 

Confidence 

limit 


Grade 4 
















TQ 


Scanned 


1 


72 


8,136 


9 


.0011 


.0026 


SQ 


Scanned 


1 


97 


9,312 


4 


.0004 


.0015 


XQ 


Scanned 


1 


66 


5,148 


4 


.0008 


.0027 


Grade 8 
















TQ 


Scanned 


1 


S8 


6,264 


4 


.0006 


.0022 


SQ 


Scanned 


1 


74 


7,104 


2 


.0003 


.0012 


XQ 


Scanned 


1 


33 


2^74 


3 


.0012 


.0047 



TQ 



Teacher questionnaire; SQ = School questionnaire; XQ = Excluded student questionnaire 





Chapter 7 

WEIGHTING PROCEDURES AND VARIANCE ESTIMATION 



Adam Chu and Keith F. Rust 
Westat, Inc. 



7.1 INTRODUCTION 

Following the collection of assessment and background data from and about assessed and 
excluded students, sampling weights and associated sets of replicate weights were derived. The 
sampling weights are needed to make valid inferences from the student samples to the 
respective populations from which they were drawn. Replicate weights are used in the 
estimation of sampling variance, through the procedure known as jackknife repeated replication. 

Each student was assigned a weight to be used for making inferences about the state’s 
students. This weight is known as the JvU-sample or overall sample weight In the 1990 Trial 
State Assessment Program, a second weight, known as the comparison weight, was also derived 
for the purpose of comparing the assessment performance of students in monitored sessions 
with those in unmonitored sessions. However, for the 1992 Trial State Assessment Program, 
comparison weights were not calculated. Valid (Le., unbiased) comparisons of this kind can be 
made using the full sample weights; however, the standard errors associated with these 
comparisons are somewhat larger than those that would be obtained using comparison weights. 

The full-sample weight contains three components. First a base weight is established 
that is the inverse of the overall probability of selection of the sampled student. The base 
weight incorporates the probability of selecting a school and the student within a school, and 
accounts for the impact of procedures used to keep to a minimum the overlap of the state 
school sample with the NAEP national sample and the sample for the National Longitudinal 
Study of Chapter 1 Children (see Chapter 3). The base weight is then adjusted for two sources 
of nonparticipation — school-level and student-level These weighting adjustments seek to 
reduce the potential for bias from such nonparticipation by increasing the weights of students 
from schools similar to those schools not participating, and increasing the weights of students 
similar to those students from within participating schools who did not attend the assessment 
session (or a makeup session) as scheduled. The details of how these weighting steps were 
implemented are given in sections 12 and IX 

hi addition to the full-sample estimation weights, a set of replicate weights was provided 
for each student. These replicate weights are used in calculating the sampling errors of 
estimates obtained from the data, using the jackknife repeated replication method. Full details 
of the method of using these replicate weights to estimate sampling errors are contained in the 
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technical reports for the 1988 and 1990 national assessments (Johnson & Zwick, 1990; Johnson 
& Alien, 1992). Section 7.5 of this report describes how the sets of replicate weights weie 
generated for the 1992 Trial State Assessment data. The methods of deriving these weights 
were aimed at reflecting the features of the sample design appropriately in each state, so that 
when the jackknife variance estimation procedure is implemented, approximately unbiased 
estimates of sampling variance result. 



7.2 CALCULATION OF BASE WEIGHTS 

The base weight assigned to a school was the reciprocal of the probability of selection of 
that school. For the fourth-grade samples, the school base weight depended on the subject of 
assessment since some schools were so smail that students were tested in only one subject in 
those schools. Under the sample selection procedures used for the 1992 Trial State Assessment 
Program (see Chapter 3), the school selection probability may be greater than 1 for large 
schools. In this case, the probability of selection actually represents the expected number of 
times the school would be selected under the systematic sampling process. In general, the 
school base weight reflected the actual probability used to select the school from the frame, 
including the impact of avoiding schools selected for the NAEP national sample and the sample 
for the National Longitudinal Study of Chapter 1 Children (see Chapter 3). 

The student base weight was obtained by multiplying the school base weight by the 
within-school student weight, where the within-school student weight reflected the probability of 
selecting students within the school for a part? .'dar assessment subject. Additional details about 
the weighting process are given in the sections i/elow. 



7.2.1 Calculation of School/hit Base Weights 

As described in section 3.4.5, schools were sometimes selected in dusters in order to 
avoid giving small schools an extremely low probability of selection. Moreover, large dusters (or 
schools) could have been selected more than once in the systematic sampling process. If a large 
duster (or school) was selected more than once, each selection or "hit” was treated separately 
in the selection of students within a school For example, a school that was selected twice was 
allocated twice the usual numbers of students for the assessments; a school that was selected 
three times was allocated three times the usual numbers of students for the assessments. 

The weight for sample duster c was computed as: 

w** » -iL 
* «£* 

where 

E c « the enrollment in the given grade for the cth duster in the state; 
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c-l 

= the state-wide enrollment in the given grade; and 
m = the number of cluster/hits selected from the state. 



If a cluster was selected more than once, each hit received the same base weight. In 
general, the base weight for sample scho il i (or school/hit i if the school was selected more than 
once) in a given state was computed as: 

wf* - w£*v d 

where Wj* is the base weight of the cluster containing school i and 7^ is a "thinning” factor 
that reflects the fact that small schools in the Cluster Type 2 states were subject to thinning (see 
section 3.5.3). The thinning factor T a was equal to the ratio of the sampling size measure of the 
largest school in the cluster to the size measure of the retained school. 

Since all schools in Ouster Type 1 states were included in the sample with certainty (see 
section 3.5.2), they were assigned school base weights (Wf*) equal to 1. 



122 Weighting New Schools 

As described in Chapter 3, new schools were sampled from the updated sampling frame 
list from each district in a sample of districts. In a few states, the selection probabilities of some 
new schools were quite small, resulting in excessively large school base weights. Where the 
weighted contribution to the estimate of total enrollment of a new school exceeded three times 
the median contribution, the base weight for that school was adjusted downwards (trimmed) in 
order to reduce the impact of the extreme weights on the variance of the estimates. Base 
weights were trimmed for a total of nine new schools in the following six states: New Jersey 
(grades 4 and 8), North Carolina (grade 4), Indiana (grade 8), Kentucky (grade 8), New York 
(grade 8), and Ohio (grade 4). For these nine schools, the trimmed school weight (which was 
then used in the subsequent calculation of nonresponse adjustments) was computed as: 




where E t is the estimated grade enrollment of the new school, and is the maximum 
allowable weighted contribution to the estimated total grade enrollment for the given state. The 
value of was established so that the weighted contribution of the new school to the total 
weighted grade enrollment never exceeded about three times the median value of the 
distribution of weighted enrollment counts for the remaining schools in the sample. 
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This adjustment was made to avoid introducing substantial variability into the sample 
estimates, as a result of giving relatively very large weights to one or two schools, and thus the 
sampled students within them. Although this procedure technically introduces a bias in the 
estimates for these states, we judged that it would be trivial in comparison to the level of 
sampling variance. For a discussion of issues involved in trimming of survey weights, see Potter 
(1988) and Stokes (1990). 



7 23 Treatment of Substitute and Double-session Substitute Schools 

Schools that replaced a refusing school (i.e., substitute schools) were assigned the weight 
of the refusing school, unless the substitute school also refused. Schools conducting extra 
sessions that served as substitutes for a refusing school (Le., double-session substitutes) in effect 
had two school weights. The students in the school who were assigned to the original session 
were given the school base weight of the participating school, while those students assigned to 
the "jctra session(s) were assigned the school base weight of the refusing schooL 



7*2.4 Calculation of Student Base Weights 

Within the sampled schools, eligible students were assigned to sessions using the 
procedures described in sections 3.5.7 and 3.6. Hie within-school probability of selection for 
assessment in mathematics therefore depended on the number of grade-eligible students in the 
school and the number of students selected for the assessment (usually 30). The within-school 
weights for the substitute schools were further adjusted to compensate for differences in the 
sizes of the substitute and the originally sampled (replaced) schools. In the case of the fourth- 
grade sample, the within-school weight also reflected the fact that a small school could have 
been selected for one subject but not the other. Thus, in general, the within-school student 
weight for the jxh student in school i was equal to: 

n U ~ 

n i 



where 



N, - the number of grade-eligible students enrolled in the school as reported in 

the sampling worksheets; and 

n t = the number of students selected for the given subject. 

The factors K v and Ky in the formula for the within-school student weight generally 
apply to only a few schools in each state. The factor K u adjusts the count of grade-eligible 
students in a substitute school to be consistent with corresponding count of the originally 
sampled (replaced) school. Specifically, for substitute schools, 
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£, » the QED grade enrollment of the originally sampled (replaced) school; 

and 

Ef ~ the QED grade enrollment of the substitute school. 

For nonsubstitute schools, K u - 1. 

The factor applies only to the fourth-grade sample and reflects the subsampling 
procedure used to select the subject in which students in small schools were to be assessed 
(section 3.5.7). For a given subject, is defined as follows: 






1 

2 

0 



if the fourth -grade school was selected for both subjects; 

if the fourth -grade school was selected only for the given subject 

if the fourth-grade school was not selected for the given subject 



Note that if K» is 2 for mathematics (say), then K v is 0 for reading, and vice versa. 

The overall student base weight for a student j selected for mathematics assessment in 
school i was then computed as: 

»(! ” “j 

Checks were made on these student base weights to ensure that the value was always 1.0 or 
greater. 



7o3 Adjustments for Nonresponse 

The base weight for a student was adjusted by two factors: one to adjust for 
nonparticipating schools for which no substitute participated, and one to adjust for students who 
were invited to the assessment but did not appear in either the scheduled or makeup sessions. 



7 .3.1 Defining Initial School-level Nonresponse Adjustment Classes 

School-level nonresponse adjustment classes were initially created based on the 
urbanicity and minority strata used in sampling. In states and urbanicity strata where minority 
stratification was not used, nonresponse classes were created based on median household 
income. 



1 K, 



133 




The procedure for creating income classes was as follows. First, three classes of schools 
were formed for each urbanicity stratum so that (1) each class had approximately the same 
number of sample schools and (2) the classes were ranked from low to high income. This was 
done using only the schools in the sample (including new schools), sorting them by median 
income, and then dividing the schools into three groups with equal numbers of schools. In a few 
states (Cluster Type 3 states) only large schools (those with grade enrollment over 20) were 
used to form the income strata, although all schools were classified into either income or 
minority strata. In creating the nonresponse adjustment classes, urbanicity was used as the 
primary variable and minority/income was used as the secondary variable. 

The initial nonresponse adjustment classes (Le., sampling strata) are summarized for 
each state in Tables 3-3 and 3-4 of Chapter 3. As can be seen in these tables, the definition of 
the initial nonresponse adjustment classes varied from one state to another. For example, nine 
classes obtained by cross-classifying three levels of urbanicity (central city, suburban, other) with 
three levels of minority status (low, medium, and high) were defined for Alabama, whereas for 
New York, the classes were defined by minority status within the central city and suburban 
strata, and by income classes within the rural stratum. 



7 32 Constructing the Final Nonresponse Adjustment Classes 

The objective in forming the final nonresponse adjustment classes was to create as many 
classes as possible that were internally as homogeneous as possible, but such that the resulting 
nonresponse adjustment factors were not subject to large random variation. Hie procedures 
discussed below were established with the aim of meeting this objective. 

The schools (or school/hits in the case of schools that were selected more than once in 
the sampling process) were sorted into the initial nonresponse classes defined in Tables 3-3 and 
3-4 and the following unweighted and weight counts and ratios were produced for each class: 

• total in-scope school/hits from the original sample (an in-scope school is one that 
has at least one eligible student enrolled); 

• participating in-scope schools from the sample (both original and substitutes); 
and 

• total in-scope schools from the original sample divided by participating in-scope 
schools from the sample 

The weights used in the calculations were the school/hit base weights defined in section 7.2, 
multiplied by the QED grade enrollment for the school 

The following guidelines were established for reviewing these counts and ratios and 
determining what collapsing should be done. Within an initial nonresponse class, if the weighted 
ratio of in-scope schools to participating schools was less than 135, with at least six participating 
schools in the class, there was no need to collapse the particular celL If any nonresponse class 
had fewer than 6 schools or a ratio greater than or equal to 135, it was collapsed with another 
class such that the new class met these conditions. The order of variables to be collapsed (from 
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most desirable to least desirable) was income strata or minority strata, followed by urbanicity 
strata. The exceptions occurred in cases where minority classes within an urbanicity stratum 
varied considerably as to the relative sizes of the minority population. In such cases, we 
collapsed over urbanicity first to keep the classes as homogeneous as possible with regard to 
race/ethnicity. In some cases, final classes were formed with ratios in excess of 1.35. This 
occurred in states with relatively high school nonresponse. In no case was ad' formed with 
fewer than six school/hits. 

The choices of 1.35 as a cutoff for the nonresponse adjustment and 6 as the minimum 
number of participants within a class were both motivated by the desire to balance two 
conflicting needs. These are described in the first paragraph of this section. These limits were 
chosen on the basis of practical experience, combined with the application of theory about the 
effects of nonresponse class size on the accuracy of survey estimates, in a manner appropriate 
for the levels of nonresponse encountered in the various states. 



733 School/hit Adjustment Factors 

The school-level nonresponse adjustment factor for the ith school/hit in the Ath class was 
computed as: 

E<E U 

=.( 1 ) _ t* 0 * 

F k 

E 

kC, 



where 
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the 
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subset of school/hit records in class A; 

base weight of the ith school/hit in class A; 

QED grade enrollment for the 1th school/hit in class h ; 

if the ith school/hit in adjustment class A participated in the 
assessments; and 



0 otherwise. 



In the calculation of the above nonresponse adjustment factors, a school was said to have 
participated if 



it was selected for the sample from the QED frame or from the lists of new 
schools provided by participating school districts, and student assessment data 
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were obtained from the school; 

• the school refused but was replaced by a regular substitute school and student 
assessment data were obtained from die substitute school (so that the substitute 
participated in place of the originally selected school); or 

• the school refused but was replaced by a double-session substitute school and the 
double-session substitute provided student assessment data for both the original 
and substitute sessions (so that the substitute school conducted additional 
sessions to replace the originally selected school). 



Both the numerator and denominator of the nonresponse adjustment factor contained only in- 
scope schools. 

The nonresponse-adjusted weight for the ith school/hit in class h was computed as: 

»£*=*>£* . 



73.4 Student-level Nonresponse Adjustment Classes 

The variables used to define initial classes for adjusting for student nonresponse were: 

• the final school-level nonresponse adjustment classes described in section 7 32; 

• the age class of the student; and 

• the monitor status of the session the student attended. 



Two age classes, “old* and “young, 1 ' were defined for both grades. For grade 8, the 
“old" students were those bom in September 1977 or earlier, while the “young" students were 
those born after September 1977. For grade 4, “old* students were those bom in September 
1981 or earlier; “young* students were those bom after September 1981. Students in the “old* 
class are to some extent outliers with regard to age among their cohort Previous findings from 
NAEP have shown that students in the "old* group tend to have higher absentee rates and 
lower proficiency scores than do students in the “young“ group. 

In order to determine whether the initial nonresponse classes needed collapsing, we 
reviewed the unweighted and weighted counts of assessed and absent students in each initial 
cell (Excluded students were processed separately, using essentially the same procedures 
developed for assessed students.) Hie weight used for each student was the student base weight, 

adjusted for school nonresponse ( W%j in section 733). The following guidelines were 
established for collapsing the initial nonresponse cells when necessary. Any cell with fewer than 
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20 assessed students was collapsed regardless of the value of the adjustment factor. If a ceil had 
between 20 and 30 assessed students and the ratio of the weighted count of invited students to 
the weighted count of assessed students was greater than 1.5, the cell was collapsed. If a cell 
had more than 30 assessed students and the ratio of the weighted count of invited students to 
the weighted count of assessed students was greater than 2.0, the cell was collapsed. 

When necessary, the collapsing of the initial cells proceeded as follows: First, collapsing 
was done across monitor status within all other classes. If the resulting cell still needed to be 
collapsed, the collapsing across monitor status was undone, and new cells were formed by 
collapsing across minority /income class. If these new cells still needed to be collapsed, 
collapsing across monitor status was done, followed by collapsing by urbanicity class and finally 
by age group, if necessary. Based on these guidelines, some collapsing was done for all states, 
usually over monitor status and particularly for “old" students. 



73,5 Student Nonresponse Adjustments 

As described above, the student-level nonresponse adjustments for the assessed students 
were made within classes defined by the final school-level nonresponse adjustment cells, monitor 
status of the school, and age group of the students. Let the kth final (collapsed) nonresponse 
class be denot ! as A k . The adjusted student base weight for the jth sample student in 
school/hit i in cmss A t was calculated as: 

« wi£"F? 



where 












the nonresponse-adjusted school/hit weight for school/hit l in school 
adjustment class h; 

the within-school weight for the Jth student in school i; 






the student base weight for student j in school hi. 



Using the adjusted student base weights, the assessed student nonresponse adjustment 
was calculated within nonresponse adjustment class A k as: 




M k 



where 
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' 1 if the jth student in adjustment class k participated in the 

assessments; and 

0 otherwise. 



For excluded students, the same basic procedures as described above for assessed 
students were used, except that the numerator and denominator contained excluded rather than 
assessed students, and monitor status and student age group were not used to form the 
adjustment classes. 

The final student weight for the Jth student in class k was then computed as: 

Tables 7-1 and 7-2 summarize the final unweighted and weighted counts of assessed and 
excluded students, by state and gra.de. Checks were made on the final student weight 
distributions and totals at the state and subgroup within state, to ensure that there were no 
unexpected weight outliers or unusual distributions. 



7.4 Characteristics of Nonresponding Schools and Students 

In the previous section procedures were described for adjusting the survey weights so as 
to reduce the potential bias of nonparticipation of sampled schools and students. To the extent 
that a nonresponding school or student is different from those respondents in the same 
nonresponse adjustment class, potential for nonresponse bias remains. 

In this section, we examine the potential for remaining nonresponse bias in two, related, 
ways. First we examine the weighted distributions, within each grade and state, of certain 
characteristics of schools and students, both for the full sample and for respondents only. This 
analysis is of necessity limited to those characteristics that are known for both respondents and 
nonrespondents, and hence cannot directly address the question of nonresponse bias. The 
approach taken does reflect the reduction in bias obtained through the use of nonresponse 
weighting adjustments. As such, it is more appropriate than a simple comparison of the 
characteristics of nonrespondents with those of nonrespondents for each state. 

The second approach Is to present some summary characteristics of nonrespondents and 
respondents from nonresponse adjustment classes where relatively large adjustment factors were 
obtained. In such classes the number of nonrespondents is relatively large, particularly in 
relation to the number of respondents available, and hence it is in these cases that the greatest 
potential for nonresponse bias exists. For those states and classes not appearing in these tables, 
it can be assumed that the potential for nonresponse bias is likely to be much less than in the 
cases shown. 
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Table 7-1 

Unweighted and Weighted Counts of Assessed Students by State and Grade 



State 


Grade 8 Mathematics 


Grade 4 Mathematics 


Unweighted 


Weighted 


Unweighted 


Weighted 


Alabama 


24*22 


51402 


2,605 


52411 


Arizona 


2,617 


41,443 


2,741 


48,978 


Arkansas 


2,556 


30469 


2,621 


31,479 


California 


2416 


314475 


2,412 


333,787 


Colorado 


2,799 


40,122 


2,906 


45,706 


Connecticut 


2,613 


29,459 


2,600 


31492 


Delaware 


1,934 


7,000 


2,040 


65,966 


District of Columbia 


1,S16 


4 465 


2499 


5483 


Florida 


2449 


116405 


2,828 


140,142 


Georgia 


2489 


77,420 


2,766 


92,994 


Guam 


1,496 


1,667 


1,933 


2,164 


Hawaii 


2,454 


11416 


2,625 


12,687 


Idaho 


2,615 


16,646 


2,784 


16,925 


Indiana 


2,659 


75,493 


2493 


72437 


Iowa 


2,816 


35,438 


2,770 


35,450 


Kentucky 


2,7 56 


45415 


2,703 


43,920 


Louisiana 


2482 


48,803 


2,792 


56,033 


Maine 


2,464 


15418 


1,898 


10,406 


Maryland 


2499 


47,980 


2,844 


55466 


Massachusetts 


2,456 


52,806 


2449 


58,076 


Michigan 


2,616 


106426 


2,412 


110456 


Minnesota 


2,471 


49,746 


2,640 


55424 


Mississippi 


2,498 


34409 


2,712 


37,655 


Missouri 


2,666 


56403 


2409 


53,922 


Nebraska 


235 


19,703 


2427 


16438 


New Hampshire 


2435 


12,129 


2465 


14411 


New Jersey 


2,174 


80,894 


2431 


74479 


New Mexico 


2461 


20,543 


2442 


22,021 


New York 


2,158 


164,133 


2484 


180433 


North Carolina 


2,769 


80,460 


2,884 


76,824 


North Dakota 


2414 


8,418 


2,193 


8,079 


Ohio 


2435 


136422 


2,637 


133,945 


Oklahoma 


2,141 


38,711 


2454 


42498 


Pennsylvania 


2,612 


113,724 


2,740 


125,005 


Rhode Island 


2,120 


9,621 


2490 


10474 


South Carolina 


2,625 


44422 


2,771 


48,038 


Tennessee 


2,485 


57,901 


2,708 


57465 


Texas 


2,614 


221,818 


2,623 


244,988 


Utah 


2,726 


31,181 


2,799 


34436 


Virgin Islands 


1,479 


1,601 


905 


1,863 


Virginia 


2,710 


69,751 


2,786 


76,029 


West Virginia 


2,690 


23,681 


2,786 


23,030 


Wisconsin 


2,814 


57427 


2,780 


58,196 


Wyoming 


2,444 


7,038 


2,605 


7,438 


TOTAL 


10830 


2408404 


110,992 


2,724,651 
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Table 7-2 

Unweighted and Weighted Counts of Excluded Students with Returned Questionnaires 

by State and Grade 

Grade 8 Mathematics I Grade 4 Mathematics 



State 


Unweighted 


Weighted | 


Unweighted 


Weighted 


Alabama 

Arizona 


155 

173 


«s 


122 

148 


2,468 

2342 


Arkansas 


176 


2,013 


154 


1307 


California 


237 


28321 


327 


46376 


Colorado 


134 


1,809 


158 


2326 


Connecticut 


186 


2,077 


173 


2340 


Delaware 


95 


303 


120 


419 


District of Columbia 


207 


454 


240 


517 


Honda 


193 


8,020 


265 


13384 


Georgia 


135 


3,799 


150 


5,130 


Guam 


56 


72 


133 


142 


Hawaii 


138 


578 


1 66 


814 


Idaho 


89 


545 


100 


605 


Indiana 


135 


3,623 


91 


2,483 


Iowa 


129 


1,498 


96 


1,196 


Kentucky 


134 


2,176 


99 


1,631 


Louisiana 


120 


2,178 


118 


2.4S8 


Maine 


116 


710 


116 


919 


Maryland 


119 


2358 


117 


2337 


Massachusetts 


213 


4,679 


214 


4364 


Michigan 


183 


6,864 


130 


6,132 


Minnesota 


88 


1,785 


93 


1,940 


Mississippi 


203 


2,643 


142 


1,919 


Missouri 


126 


2,630 


107 


2,658 


Nebraska 


108 


837 


117 


854 


New Hampshire 


152 


686 


98 


548 


New Jersey 


168 


5,900 


132 


4382 


New Mexico 


154 


1,158 


165 


1,747 


New York 


193 


15334 


127 


10,073 


North Carolina 


102 


2,709 


121 


3367 


North Dakota 


63 


207 


44 


165 


Ohio 


174 


8,833 


156 


8,606 


Oklahoma 


184 


2,605 


199 


3325 


Pennsylvania 


127 


5,153 


112 


5,184 


Rhode Island 


119 


517 


151 


627 


South Carolina 


no 


2,771 


142 


2,429 


Tennessee 


13 6 


2,978 


114 


2,603 


Texas 


205 


15,633 


232 


20,093 


Utah 


131 


1,425 


125 


1,452 


Virgin Islands 


77 


86 


24 


48 


Virginia 


153 


3,890 


156 


4303 


West Virginia 


177 


1,454 


134 


1,113 


Wisconsin 


127 


2,610 


135 


3315 


Wyoming 


107 


289 


98 


272 


TOTAL 


6367 


159,572 


6,161 


181313 
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7.4.1 Weighted Distributions of Schools Before and After School Nonresponse 

Tables 7-3 and 7-4 show the mean values of certain school characteristics, both before 
and after nonresponse. The means are weighted appropriately to reflect whether nonresponse 
adjustments have been applied (i.e. s to respondents only) or not (to the full set of in-scope 
schools). The variables for which means are presented are the percentage of students in the 
school who are Black, the percentage who are Hispanic, the median income of the ZIP code 
area where the school is located, and the “type of locale.” All variables were obtained from the 
sample frame, described in Chapter 3. The type of locale variable has seven possible levels, 
which are defined in section 3.4.2. Although this variable is not interval-scaled, the mean value 
does give an indication of the degree of urbanization of the population represented by the 
school sample (lower values for type of locale indicate a greater degree of urbanization). 

Two sets of means are presented for these four variables. The first set shows the 
weighted mean derived from the full sample of in-scope schools; that is, respondents and 
nonrespondents (for which there was no participating substitute). The weight for each sampled 
school is the product of the school base weight and the grade enrollment. This weight therefore 
represents the number of students in the state represented by the selected schooL The second 
set of means is derived from responding schools only, after school substitution. In this case the 
weight for each school is the product of the nonresponse-adjusted school weight and the grade 
enrollment, and therefore indicates the number of students in the state represented by the 
responding school. 

The differences between these sets of means give an indication of the potential for 
nonresponse bias that has been introduced by nonresponding schools for which there was no 
participating substitute. For example, in Arkansas at grade 4 the mean percentage Black 
enrollment, estimated from the original sample, is 24.92 percent. The estimate from the 
responding schools is 24.47 percent. Thus there may be a slight bias in the results for Arkansas 
because these two means differ. Note, however, that throughout these two tables the differences 
in the two sets of mean values are very slight, suggesting that it is unlikely that substantial bias 
has been introduced by schools that did not participate and for which no substitute participated. 
Of course in a number of states (as indicated) there was no nonresponse at the school level, so 
that these sets of means are identical Even in those states where school nonresponse was 
relatively high (such as Maine, New Jersey, and New York), the differences in means are slight. 



7.42 Characteristics of Nonresponding Schools 

Tables 7-5 and 7-6 show the distributions of some characteristics of nonresponding and 
responding schools, by school nonresponse adjustment class, for classes with adjustment factors 
in excess of 125. Table 7-5 shows results for grade 4, Table 7-6 for grade 8. The respondents 
include the case where substitute schools participated. In other words, the nonrespondents 
include only those nonrespondents for which no substitute participated. 
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Table 7-3 

Weighted Mean Values Derived from Sampled Schools, Grade 4 



State 


Weighted 
Puti^Mtioi 
Kate After 
Substitution 


Weighted Mean Values Derived tr. m Full 
Sample 


Weighted Mean Values Derived Croat Responding 
Sample, with Substitute* and School Nonresponse 
Adjustment 


Percent 

Black 


Percent 

Hispanic 


Median 

Income 


Trpeef 

lamlt 


Percent 

Black 


Percent 

piyk 


Median 

’’iconee 


l>peef 

Locate 


Alabama 


97% 


31.72% 




$22359 


4.67 


31.46% 


0.04% 


$22,455 


4.67 


Arizona 


100% 


4.06% 


2136% 


$29,763 


3.17 




2136% 


$29,783 


3.17 


Arkansas 


99% 


24.92% 




$21375 


538 


24.47% 


039% 


$21,405 


538 


California 


97% 


832% 


3531% 


$32,636 


3.17 


835% 


3539% 


$32,682 


3.16 


Colorado 


100% 


430% 


1738% 


$31,517 


3.67 


430% 


1738% 


$31317 


3.67 


Connecticut 


99% 


9.82% 


8.46% 


$39332 


3.66 


9.82% 


8.46% 


$39360 


3.65 


Delaware 


92% 


2435% 




$25343 


4.48 


2330% 


032% 


S25390 


4.48 


Dist. of Columbia 


99% 


9031% 


3.47% 


$27,916 


1.00 


91.07% 


3.49% 


$27,901 


1.00 


Florida 


100% 


24.17% 


1038% 


$27380 


339 


24.17% 


1038% 


$27380 


359 


Georgia 


100% 


33.93% 


134% 


$28306 


4.41 


3333% 


134% 


$28306 


4.41 


Guam 


94% 


221% 




- 


7.00 


237% 


031% 


- 


7.00 


Hawaii 


100% 


1.41% 




$33,990 


4.00 


1.41% 


0.00% 


$33,990 


4.00 


Idaho 


97% 


0.12% 


438% 


S25320 


5.42 


0.12% 


438% 


$25358 


$.42 


Indiana 


91% 


1136% 




528,441 


435 


1037% 


038% 


$28395 


435 


Iowa 


100% 


0.96% 




$26328 


4.92 


036% 


035% 


$26328 


432 


Kentucky 


96% 


737% 






S38 


7.40% 


0.06% 


S223&1 


539 


Louisiana 


100% 


45.18% 






438 


45.18% 


032% 


$22,414 


438 


Maine 


71% 


0.17% 






5.73 


030% 


032% 


$26,750 


5.72 


Maryland 


99% 


2739% 


136% 




3.45 


27.77% 


138% 


$39,949 


3.45 


Massachusetts 


97% 


6.77% 


4.12% 


$37,119 


3.71 


6.79% 


4.12% 


$37,123 


3.70 


Michigan 


90% 


14.79% 


036% 


$31,784 


411 


14.60% 


1,12% 


$31,913 


4.10 


Minnesota 


94% 


2.03% 


035% 


$32,178 


4.73 


137% 


034% 


$32,426 


4.73 


Mississippi 


100% 


4734% 


0.17% 


$19,437 


538 


4734% 


0.17% 


$19,437 


538 


Missouri 


97% 


15.40% 


0.66% 


$27331 


4.47 


1536% 


0.64% 


$27,080 


431 


Nebraska 


87% 


3.94% 


033% 


$27,906 


4.79 


333% 


1.04% 


$27,981 


430 


New Hampshire 


80% 


0.73% 


036% 


$35,647 


531 


0.65% 


039% 


$35,716 


532 


New Jersey 


82% 


15.73% 


837% 


$40336 


HK£l 


14.84% 


8.6S% 


$40,099 


339 


New Mexico 


90% 


160% 


44.47% 


$22,600 


4.63 


2.74% 


4537% 


$22355 


4.64 


New York 


83% 


15.75% 






3.17 


14.70% 


16.17% 


$32314 


3.18 


North Carolina 


99% 


2734% 




$26,066 


4.94 


2731% 


0.01% 


$26,131 


4.94 


North Dakota 


90% 


0.46% 


0.06% 


$26,916 


S.07 


038% 


0.07% 


$26,624 


5.11 


Ohio 


91% 


1034% 


035% 


$28,765 


4.12 


935% 


032% 


$28,965 


4.15 


Oklahoma 


98% 


7.46% 


132% 


$25,421 


4.49 


632% 


133% 


$25,444 


430 


Pennsylvania 


95% 


1232% 


330% 


$28341 


437 


1236% 


332% 


$28347 


436 


Rhode Island 


96% 


430% 


3 35% 


.**0,169 


338 


330% 


4.00% 


$30,057 


3 36 


South Carolina 


99% 


3731% 


0.07% 


$^6303 


5.00 


37.42% 


0.07% 


$26359 


5.00 


Tennessee 


93% 


2038% 


0.06% 


$24,437 


4.12 


2136% 


0.06% 


$24326 


4.12 


Texas 


98% 


14.45% 


34.15% 


$26340 


3.44 


1430% 


3434% 


$26372 


3.44 


Utah 


98% 


0.13% 


0.84% 


$31,139 


435 


0.13% 


034% 


$31,149 


435 


Virginia 


99% 


2435% 


138% 


$36,618 


4.19 


2432% 


133% 


$36,445 


4.19 


West Virginia 


100% 


230% 




$21348 


5.63 


230% 


0.16% 


$21348 


5.63 


Wisconsin 


100% 


630% 


134% 




437 


630% 


134% 


$31350 


437 




97% 


0.63% 


637% 




5.40 


0.64% 


6.68% 


$30348 


5.40 
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Table 7-4 

Weighted Mean Values Derived from Sampled Schools, Grade 8 



State 


Weighted 
Participation 
Bate After 
Substitution 


Weighted Mean Value* Derived from Fall 
Sample 


Weighted Mean Vabes Derived from Responding 
Sample, with Substitutes and School Nonresponse 
Adjustment 


Percent 

Black 


Percent 

Hispanic 


Median 

Income 


Type 

Locale 


Percent 

Black 


Percent 

Hispanic 


Median 

Income 


Type of 
Locale 


Alabama 


92% 


31 23% 


0.03% 


521,746 


492 


3198% 


0.02% 


521,636 


\Jl 


Arizona 


99% 


3.74% 


22.74% 


529,074 


330 


3.75% 


2259% 


528,987 


330 


Arkansas 


97% 


24.13% 


033% 


521,692 


5.44 


23.47% 


033% 


521,632 


5.44 


California 


98% 


8.89% 


3150% 


534,481 


3.19 


8.65% 


3132% 


534,641 


3.19 


Colorado 


100 % 


4.64% 


1651% 


$31,468 


3.71 


4.64% 


1651% 


531,468 


3.71 


Connecticut 


99% 


8.94% 


658% 


$40,763 


3.92 


8.94% 


6.60% 


$40,787 


391 


Delaware 


100 % 


20.91% 


0.61% 


$32,463 


5.19 


2091% 


0.61% 


532463 


5.19 


Dist. of Columbia 


100% 


92.49% 


4.08% 


528,443 


1.00 


9249% 


4.08% 


$28,443 


1.00 


Florida 


100% 


2335% 


1158% 


528,009 


355 


2335% 


1158% 


$28,009 


355 


Georgia 


99% 


3451% 


0.94% 


528,048 


4.46 


345% 


0.90% 


$27,602 


4.47 


Guam 


100% 


150% 


0.00% 


- 


7.00 


150% 


0.00% 


- 


7.00 


Hawaii 


100% 


093% 


0.08% 


533,823 


3.91 


093% 


0.08% 


533,825 


391 


Idaho 


91% 


0.13% 


4.75% 


$25398 


531 


0.13% 


4.49% 


525,577 


551 


Indiana 


94% 


9.16% 


0.90% 


$28,982 


4.62 


894% 


0.42% 


$29,046 


4.62 


Iowa 


99% 


120% 


037% 


525,962 


5.15 


120% 


027% 


$25,997 


5.15 


Kentucky 


98% 


7.09% 


0.12% 


$22,603 


5.13 


7.18% 


0.13% 


522,602 


5.12 


Louisiana 


100% 


44.78% 


139% 


522,819 


431 


44.78% 


139% 


522819 


431 


Maine 


84% 


0.17% 


053% 


$27,012 


5.75 


0.09% 


050% 


$27,092 


525 


Maryland 


91% 


2957% 


1.14% 


540,625 


333 


29.13% 


1.13% 


540213 


353 


Massachusetts 


95% 


3.63% 


5.07% 


$37351 


3.90 


350% 


535% 


$37202 


391 


Michigan 


94% 


16.78% 


0.94% 


$32,144 


4.18 


16.61% 


055% 


532232 


4.18 


Minnesota 


92% 


159% 


0.45% 




4.74 


0.76% 


0.43% 


$32022 


4.75 


Mississippi 


100% 


44.03% 


0.19% 




5.49 


44.03% 


0.19% 


519,000 


5.49 


Missouri 


99% 


11.49% 


0.90% 




4.77 


1155% 


090% 


526,944 


4.77 


Nebraska 


85% 


2.97% 


0.66% 


$27331 


522 


325% 


0.68% 


526,933 


5.13 


New Hampshire 


92% 


0.76% 


036% 


535,647 


551 


0.65% 


0.69% 


$35,716 


522 


New Jersey 


78% 


16.65% 


737% 


$40552 


3.67 


1555% 


853% 


540220 


3.67 


New Mexico 


94% 


2.03% 


4394% 


$22,788 


4.65 


213% 


4430% 


$22967 


4.65 


New York 


83% 


17.06% 


1230% 


$33596 


325 


16.17% 


1122% 


534276 


327 


North Carolina 


98% 


26.78% 


0.07% 


526,064 


5.00 


2626% 


0.07% 


$26,059 


5.00 


North Dakota 


97% 


0.19% 


0.03% 


$26506 


528 


0.19% 


0.03% 


526206 


528 


Ohio 


90% 


1130% 


031% 


528,658 


430 


1158% 


026% 


528,447 


430 


Oklahoma 


68% 


6.45% 


0.72% 


524,635 


451 


651% 


0.73% 


524,622 


431 


Pennsylvania 


94% 


10.81% 


134% 


$29382 


4.46 


1132% 


1.43% 


$29,109 


4.43 


Rhode Island 


100% 


3.74% 


339% 


530,846 


3.60 


3.74% 


329% 


$30,845 


3.60 


South Carolina 


97% 


3633% 


034% 


526,141 


499 


36.14% 


025% 




499 


Tennessee 


91% 


18.93% 


0.09% 


$23,731 


439 


1927% 


0.10% 




438 


Texas 


99% 


12.92% 


31.62% 


$26,923 


352 


1276% 


3159% 


526,872 


352 


Utah 


100% 


0.10% 


098% 


530,655 


422 


0.10% 


098% 




422 


Virginia 


97% 


2250% 


1.66% 


536,906 


425 


2253% 


1.69% 


$37,021 


4.25 


Virgin Islands 


100% 


8796% 


10.01% 


- 


- 


8796% 


10.01% 


- 


- 


West Virginia 


100% 


3.88% 


0.12% 


$21,623 


5.60 


3.88% 


0.12% 


$21,623 


5.60 


Wisconsin 


100% 


531% 


0.98% 


$31,446 


4.76 


531% 


098% 


$31,446 


4.76 




99% 


0.63% 


6.69% 


$30,977 


5.47 


0.63% 


6.74% 




5.47 
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Table 7-5 

Grade 4 School Nonresponse Adjustment Classes with Adjustment Factors Greater than 135 
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1 In Dim ttttM, larger schools were selected hie the sample more than once. Thus, the number of school selections map exceed somewhat the actual number of schools involved. 
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Table 7-5 (continued) 

Grade 4 School Nonresponse Adjustment Gasses with Adjustment Factors Greater than 1.25 
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Table 7-5 (continued) 

Grade 4 School Nonresponse Adjustment Classes with Adjustment Factors Greater than 1.25 
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In some states, larger schools were selected into the sample more than once. Thus, the number of school selections may exceed somewhat the actual number of schools involved. 
Median household income of ZIP code area where school is located, derived from 1980 population census data and expressed in 1985 dollars. 

































Table 7-6 

Grade 8 School Nonresponse Adjustment Classes with Adjustment Factors Greater than 125 
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Table 7-6 (continued) 

Grade 8 School Nonresponse Adjustment Classes with Adjustment Factors Greater than 1.25 
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1 In some states, larger schools were selected into the sample more than once. Thus, the number of school selections may exceed somewhat the actual number of schools involve* 
3 Median household income of ZIP code area where school is located, derived from 1980 population census data and expressed in 1985 dollars. 





The characteristics shown are as follows: 



• The set of distinct values for the m type of locale* variable. This variable, which was 
used for sample stratification, has seven possible levels, which are defined in 
Chapter 3, section 3.4.2. 

• The percentage of the state ‘s public-school grade enrollment represented in the sample 
by die schools within die adjustment class . Hie school nonresponse adjustment 
factor is calculated directly from these two quantities (one for respondents, one 
for nonrespondents). Hie potential for nonresponse bias is generally greater in 
cases where the size of the set of nonrespondents is relatively large. 

• The minimum , median, and maximum percentage enrollments of Black and Hispanic 
students. In cases where there are only two nonresponding school/hits involved, 
only the minimum and maximum are presented. In the case of a single 
nonresponding school, the value for that school is presented as the minimum. 

• The minimum, median , and maximum household incomes of the five digit ZIP code 
area where the school is located. The data are calculated from 1980 Census data, 
but are updated to 1985 dollars. 

Examination of the table shows that invariably the respondents and nonrespondents are 
quite similar with regard to type of locale. There are great similarities in many cases for other 
characteristics also, but on some occasions the nonresponding schools have a somewhat lower 
median income distribution than the respondents, and occasionally also there is some difference 
in the distributions of minority enrollment levels. For example, for grade 4, in New York, Class 
3, the nonresponding schools have somewhat higher rates of Black and Hispanic enrollment and 
somewhat lower median household incomes than the respondents. For grade 8, in Minnesota, 
Class 1, the nonresponding schools have semewhat greater Hispanic enrollment and noticeably 
lower median income than the respondents. By contrast, in Nebraska, Class 1, the 
nonresponding schools have somewhat lower Hispanic enrollment and noticeably higher median 
income than the respondents. 



7.43 Weighted Distributions of Students Before and After Student Absenteeism 

Tables 7-7 and 7-8 show, for each state, the weighted sampled percentages of students by 
gender (male) and race/ethnicity (White, not Hispanic; Black, not Hispanic; Hispanic) for the 
full sample of students (after student exclusion) and for the assessed sample. Table 7-7 shows 
results for grade 4; Table 7-8 shows results for grade 8. 

The weight used for the full sample is the adjusted student base weight, defined in 
section 733. The weight for the assessed students is the final student weight, also defined 
section 733. The difference between the estimates of the population subgroups is an estimate 
of the bias in estimating the size of the subgroup, resulting from student absenteeism from the 
assessment. As such it is an indicator of the potential for nonresponse bias in the assessment 
results, resulting from student absenteeism. 
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Table 7-7 

Weighted Student Percentages Derived from Sampled Schools, Grade 4 



State 


Weighted 

Student 

Participation 


Weighted Percentages Derived front Foil 
Sample 


Weighted Percentages Derived Crone Assessed 
Sample, with Student Nonresponse Adjustment 


Percent 

Male 


Percent 

White 


Percent 

Black 


Percent 

Hispanic 


Percent 

Male 


Percent 

White 


Percent 

Slack 


Percent 

Hispanic 


Arizona 


95% 


51.25 


56.17 


4.10 


28.48 


5135 


5596 


4.17 


28.70 


Arkansas 


96% 


5292 


69.68 


21.03 


6.09 


53.06 


6921 


21.10 


6.40 


California 


94% 


51.14 


44.78 


650 


3534 


51.61 


44.77 


6.43 


3497 


Colorado 


95% 


4*78 


67.79 


533 


21.48 


4993 


6733 


522 


2155 


Connecticut 


96% 


48.62 


72.94 


1022 


1350 


48.71 


7259 


10.18 


13.49 


Delaware 


95% 


51.27 


65.74 


2299* 


791 


50.63 


6557 


22.73 


8.04 


Dist. of Columbia 


93% 


48.12 


530 


82.42 


9.16 


47.75 


534 


82.10 


952 


Florida 


95% 


48.44 


58.16 


2134 


17.09 


4837 


5790 


2128 


1729 


Georgia 


95% 


51.76 


5597 


3556 


593 


51.46 


55.67 


35.49 


6.19 


Guam 


95% 


51.47 


1226 


3.87 


19.08 


51.64 


12.07 


351 


1950 


Hawaii 


95% 


49.85 


2051 


4.14 


17.18 


4924 


20.70 


427 


1790 


Idaho 


97% 


4959 


84.14 


057 


1122 


49.18 


8356 


058 


11.45 


Indiana 


96% 


49.79 


81.99 


1035 


5.19 


5030 


82.08 


995 


538 


Iowa 


96% 


50.99 


8956 


234 


5.14 


5053 


8953 


222 


529 


Kentucky 


96% 


4958 


84.92 


8.74 


4.04 


49.47 


84.80 


8.63 


429 


Louisiana 


95% 


51.90 


4956 


42.82 


4.78 


51.77 


4950 


4254 


5.03 


Maine 


95% 


49.41 


9121 


0.61 


4.92 


49.02 


90.93 


059 


5.75 


Maryland 


96% 


49.18 


58.77 


30.13 


5.79 


49.63 


585 2 


30.01 


596 


Massachusetts 


95% 


50.65 


78.77 


7.76 


796 


5053 


79.15 


7.46 


8.04 


Michigan 


94% 


52.04 


73.43 


13.95 


827 


51.71 


7334 


3335 


8.74 


Minnesota 


95% 


5023 


85.79 


2.64 


690 


5026 


85.44 


259 


729 


Missisdppi 


97% 


50.10 


40.76 


51.89 


5.60 


51.68 


40.40 


52.06 


553 


Missouri 


96% 


51.77 


7696 


1426 


556 


S195 


7698 


1351 


556 


Nebraska 


96% 


50.44 


8356 


6.73 


6.70 


50.76 


84.12 


596 


7.04 


New Hampshire 


96% 


50.11 


8924 


122 


4.99 


5028 


8892 


124 


5.16 


New Jersey 


96% 


5059 


6325 


1556 


15.02 


5097 


65.77 


13.69 


1438 


New Mexico 


95% 


47.09 


4453 


352 


4627 


46.71 


4393 


3.60 


46.65 


New York 


96% 


51.61 


61.41 


1250 


20.18 


51.65 


5931 


1257 


21.69 


North Carolina 


95% 


5120 


6237 


2828 


530 


5133 


6152 


2851 


559 


North Dakota 


96% 


52.69 


9129 


0.45 


3.60 


5293 


91.13 


0.48 


352 


Ohio 


95% 


5123 


7896 


1121 


551 


5128 


79.13 


10.67 


6.15 


Oklahoma 


84% 


5051 


7296 


928 


6.62 


50.72 


7259 


9.18 


7.09 


Pennsylvania 


96% 


52.92 


7655 


12.65 


7.49 


5295 


76.94 


12.19 


7.44 


Rhode Island 


95% 


51.43 


7750 


631 


1052 


51.05 


77.77 


627 


1056 


South Carolina 


97% 


5028 


5532 


3720 


550 


49.73 


5490 


37.33 


5.72 


Tennessee 


96% 


5228 


69.78 


2257 


5.02 


52.49 


69.06 


23.04 


525 


Texas 


96% 


4927 


48.70 


14.19 


33.79 


49.12 


4892 


1421 


3352 


Utah 


96% 


5027 


8597 


095 


956 


50.71 


85.73 


099 


9.73 


Virginia 


95% 


5054 


6724 


23.45 


4.69 


50.90 


6728 


23.16 


4.78 


West Virginia 


96% 


49.03 


90.40 


254 


452 


4852 


90.18 


252 


4.73 


Wisconsin 


96% 


5136 


8U1 


622 


7.16 


51.47 


8091 


6.18 


732 


Wyoming 


96% 


50.45 


8259 


0.93 


11.05 


50.47 


82.41 


094 


1132 
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Tabic 7-8 

Weighted Student Percentages Derived from Sampled Schools, Grade 8 



State 


Weighted 

Student 

Participation 


Weighted Percentages Derived from Full 
Sample 


Weighted Percentages Derived from Assessed 
Sample, with Student Nonresponse Adjustment 


Percent 

Male 


Percent 

White 


Percent 

Black 


Percent 

Hispanic 


Pufttit 

Mali 


Percent 

Whit# 


Percent 

Blade 


Percent 


Arizona 


93% 


51.23 


60.70 


4.04 


27.06 


5083 


6030 


384 


27.61 


Arkansas 


94% 


5054 


7268 


2135 


380 


50.76 


7235 


2139 


422 


California 


92% 


49-82 


44.18 


731 


3584 


4882 


4431 


689 


36.07 


Colorado 


93% 


50.47 


73.11 


436 


18.06 


51.07 


7384 


4.17 


1782 


Connecticut 


94% 


50.22 


7213 


1205 


1250 


50.09 


7239 


1181 


1226 


Delaware 


92% 


50.41 


6532 


2532 


5.68 


50.12 


6488 


25.47 


623 


D. of Columbia 


85% 


49.37 


276 


8533 


832 


48.74 


282 


8485 


9.71 


Florida 


91% 


49.01 


5633 


23.02 


1736 


4887 


5531 


2281 


1831 


Georgia 


93% 


48.36 


59.18 


34.72 


227 


48.03 


58.63 


3486 


4.04 


Guam 


90% 


5288 


4.85 


125 


1335 


52.13 


430 


135 


15.19 


Hawaii 


90% 


5231 


17.40 


284 


16.79 


5185 


17.18 


278 


1825 


Idaho 


95% 


5130 


88.19 


0.69 


730 


5126 


8785 


0.70 


7.49 


Indiana 


94% 


50.61 


84.49 


8.68 


439 


50.67 


84.64 


8.45 


4.46 


Iowa 


95% 


52.49 


9233 


187 


3.41 


5234 


9234 


1.79 


336 


Kentucky 


96% 


50.03 


86.66 


8.7S 


286 


50.11 


8632 


8.78 


3.00 


Louisiana 


93% 


4657 


53.95 


39.60 


4.18 


47.17 


5381 


3931 


437 


Maine 


92% 


51.00 


94.04 


030 


1.81 


51.11 


9380 


039 


186 


Maryland 


93% 


5038 


59.45 


30.08 


6.13 


50.13 


5982 


29.40 


634 


Massachusetts 


94% 


50.23 


84.79 


4.76 


7.69 


50.19 


8326 


535 


8.44 


Michigan 


94% 


48.05 


7216 


1938 


4.70 


48.16 


7326 


18.41 


485 


Minnesota 


94% 


49.78 


9133 


1.48 


334 


4939 


91.44 


136 


3.49 


Mississippi 


95% 


4821 


4938 


43.90 


5.43 


4787 


49.11 


43.76 


5.78 


Missouri 


95% 


5155 


8254 


1133 


282 


5188 


8242 


1186 


299 


Nebraska 


96% 


53.02 


86.72 


4.78 


522 


5267 


8634 


4.93 


532 


New Hampshire 


94% 


50.12 


91.82 


086 


276 


5037 


91,41 


083 


285 


New Jersey 


94% 


49.48 


5937 


18.60 


14.94 


49.15 


61.12 


17.42 


1435 


New Mexico 


93% 


49.79 


4327 


236 


49.16 


49.74 


43.74 


238 


4834 


New York 


92% 


48.88 


6622 


1433 


1235 


49.12 


61.42 


1738 


1429 


North Carolina 


94% 


49.85 


6832 


26.61 


235 


4982 


6830 


2638 


268 


North Dakota 


96% 


5057 


9290 


0.44 


234 


51.07 


9332 


0.41 


264 


Ohio 


93% 


5031 


7936 


13.78 


3.62 


5031 


79.74 


13.63 


387 


Oklahoma 


80% 


50.76 


7330 


8.05 


5.41 


4986 


7434 


7.69 


6.06 


Pennsylvania 


94% 


5028 


8337 


1037 


321 


5039 


83.08 


10.68 


326 


Rhode Island 


93% 


49.77 


81.10 


581 


785 


4985 


8038 


586 


8.15 


South Carolina 


94% 


5035 


58.05 


3481 


5.16 


50.47 


5789 


3437 


536 


Tennessee 


94% 


493? 


76.03 


2021 


239 


49.69 


7533 


20.61 


2.64 


Texas 


94% 


49.09 


4831 


1206 


35.66 


4924 


4781 


11.95 


34..05 


Utah 


94% 


51.65 


8934 


0.60 


636 


5136 


8930 


0.62 


'*39 


Virginia 


94% 


50.61 


6922 


2200 


437 


5028 


6886 


2181 


4.63 


Virgin Islands 


92% 


54.40 


1.18 


7785 


20.17 


5274 


133 


7683 


21.12 


West Virginia 


94% 


48.94 


91.19 


427 


238 


48.78 


9080 


436 


255 


Wisconsin 


94% 


5037 


85.62 


687 


434 


S033 


8585 


680 


439 


Wyoming 


95% 


50.12 


86.14 


0.76 


8.60 


50.02 


86.19 


0.82 


835 
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Care must be taken in interpreting these results, however. First, note that there is 
generally very little difference in the proportions estimated from the full sample and those 
estimated from the assessed students. While this is encouraging, it does not eliminate the 
possibility that bias exists, either within the state as a whole, or for results for gender and 
* ace/ethnicity subgroups, or for other subgroups. Second, on the other hand, where differences 
do exist they cannot be used to indicate the likely magnitude or direction of the bias with any 
reliability. For example, at grade 4 in New Jersey, the percentages of Black and Hispanic 
students in the full sample are respectively 15.57 and 15.21 percent For assessed students, these 
percentages are 13.69 for Black students and 1439 for Hispanic students. While these 
differences raise the possibility that some bias exists, it is not appropriate to speculate on the 
magnitude of this bias by considering the assessment results for Black and Hispanic student 
students, in comparison to other students in the state. This is because the underrepresented 
Black and Hispanic students may not be typical of students that were included in the sample, 
and similarly those students within the same rarial/ethnic groups who are disproportionately 
overrepresented may not be typical either. This is because not all students within the same 
race/ethnicity group receive the same student nonresponse adjustment Some insight as to the 
kinds of students who are receiving relatively large adjustments, and the kinds of students that 
they are being adjusted to represent, are given in the next section. Small sample sizes within 
ncnrespoi.se adjustment classes make this information difficult to interpret, however. One other 
feature to note is that, for assessed students, information as to the student’s gender and 
race /ethnicity is provided by the student, while for absent students this information is provided 
by the school evidence from past NAEP assessments (see, for example, Rust & Johnson, 1992) 
indicates that there can be substantial discrepancies between those two sources, especially with 
regard to classifying students as Hispanic at grade 4. 



7.4.4 Characteristics of Absent Students 

Tables 7-9 and 7-10 show some characteristics of assessed (responding) and absent 
(nonresponding) students, by student nonresponse adjustment class, for classes with adjustment 
factors in excess of 135. Table 7-9 shows results for grade 4, Table 7-10 for grade 8. 

In addition to information characterizing the class in terms of age class, monitor status, 
and type of location, the distributions of certain characteristics of assessed and absent students 
within each class are presented. The characteristics shown are: 

• The percentage of the state *s public-school grade enrollment represented in the sample 

by the students within the adjustment class. This is given by the sum of the 

adjusted student base weights (W$, see section 733) for the responding and 
nonresponding selected students respectively, within the student-level 
nonresponse adjustment class. The student nonresponse adjustment factor is 
calculated directly from these two quantities (one for respondents, one for 
nonrespondents). The potential for nonresponse bias is generally greater in cases 
where the size of the population represented by the nonrespondents is relatively 
large. 







Table 7-9 

Grade 4 Student Nonresponse Adjustment Classes with Adjustment Factors Greater than 1.25 
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Age clast 1 consists of students bom In September 1977 or earlier. All other student* are in age class 2. 





Table 7-10 (continued) 

Grade 8 Student Nonresponse Adjustment Classes with Adjustment Factors Greater than 1.25 
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Age class 1 consists of students bom In September 1977 or earlier. All other students are in age class 2. 





Tabic 7-10 (continued) 

Grade 8 Student Nonresponse Adjustment Classes with Adjustment Factors Greater than 1.25 
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Tabic 7-10 (continued) 

Grade 8 Student Nonresponse Adjustment Classes with Adjustment Ft ' ts Greater than 1.25 
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• The percentage of students who are male , weighted by the base weight for each 
student adjusted for school nonresponse. This estimates the proportion of students 
who are male in the subpopulation represented by the sample students. 

* The percentages of students who are White, Black, Hispanic, or of another 
race/ethnicity. Again these percentages are weighted by the students’ base 
weights, adjusted for school nonresponse. 

The table shows that assessed and absent students have similar characteristics within 
nonresponse adjustment classes. A notable feature is that most of the cases involving 
adjustment factors in excess of 125 occur within classes in which the students are in age class 
1— -that is, relatively old for their grade. Since both the respondents and nonrespondents share 
this characteristic, this is not in itself a source of nonresponse bias. The potential for bias arises 
because of the possibility that, within this group, the respondents differ from the 
nonrespondents. 

Note that invariably within a cell the size of the population represented by the 
nonrespondents is relatively smalL Thus it is not likely in any state that substantial nonresponse 
bias could be arising from the nonresponse within a single cell Rather, if such bias is occurring, 
it must be aggregated across a number of cells having varying characteristics except perhaps for 
the fact that they involve students of above average age. The small number of nonrespondents 
within each cell (often as few as five or six) makes it difficult to compare the characteristics of 
nonrespondents with those of respondents and to characterize the nonrespondents’ distributions 
of gender, race/ethnicity, and median household income. 

Of particular note in these tables is the presence of a large number of cells for 
Oklahoma with adjustment factors in excess of 1.25. This occuis because Oklahoma is the only 
state that required written parental consent before a selected student could participate in the 
assessment. This requirement resulted in much greater student nonresponse overall than in 
other states. What the results in Tables 7-9 and 7-10 suggest is that this nonresponse is very 
widely distributed across the various adjustment classes, and is not concentrated among 
particular types of students. This lessens (but does not eliminate) the likelihood that the high 
levels of student nonresponse in Oklahoma have introduced substantial nonresponse bias. 



7j 5 Variation in Weights 

After completion of the weighting steps, an analysis was conducted of the distribution of 
the final student weights in each state. The analysis was intended to check that the various 
weight components had been derived properly in each state and to examine the impact of the 
variability of the sample weights on the precision of the sample estimates, both for the state as a 
whole and for major subgroups within the state. 

The analysis was conducted by looking at the distribution of the final student weights, 
both for the approximately 2,500 assessed students in each state, grade, and subject, and for 
subgroups defined by age, gender, race/etlmichy, level of urbanicity, and level of parents’ 
education. Two key aspects of the distribution were considered in each case: the coefficient of 
variation (equivalently, the relative variance) of the weight distribution; and the presence of 
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outliers (Le., cases whose weights were several standard deviations away from the median 
weight). 



It was important to examine the coefficient of variation of the weights because a large 
coefficient of variation reduces the effective size of the sample. Assuming that the variables of 
interest for individual students are uncorrelated with the weights of the students, the sampling 



variance of an estimated average or aggregate is approximately 



(1 




} times as great as 



the corresponding sampling variance based on a self-weighting sample of the same size, where C 
is the coefficient of variation of the weights expressed as a percent. Outliers, or cases with 
extreme weights, were examined because the presence of such an outlier was an indication of 
the possibility that an error was made in the weighting procedure, and because it was likely that 
a few extreme cases would contribute substantially to the size of the coefficient of variation. 



In most states, the coefficients of variation were 35 percent or less, both for the whole 
sample and for all major subgroups. This means that the quantity } was generally 

below 1.1, and the variation in sampling weights had little impact on the precision of sample 
estimates. 



Large student weights were observed in a few states. These extreme weights generally 
affected those students in schools for which the grade enrollment available at the time of sample 
selection proved to be several-fold short of the actual enrollment. An evaluation was made of 
the impact of trimming these largest weights back to a level consistent with the remaining large 
weights found in the state. Such a procedure produced some reduction in the size of the 
coefficient of variation. It was sufficiently modest in each case, however, that we judged that the 
potential for the introduction of bias through trimming, when combined with the considerable 
effort required to implement an appropriate trimming procedure, was such that it was preferable 
not to apply any trimming to the weights in these states. The analyses conducted confirmed that 
weight components had been calculated and combined correctly, and it was concluded that 
weight trimming should not be undertaken. Note, however, that weight trimming of school base 
weights had already been applied in a few cases, prior to the analyses discussed heie (see 
section 122 ). 



7.6 Calculation of Replicate Weights 

A method known as jackknife replication was used to estimate the sampling variance of 
statistics derived from the full sample. The process of replication involves repeatedly selecting 
portions of the sample to calculate the statistic of interest; the resultant estimates are known as 
replicate estimates. The variability among the calculated replicate estimates is then used to 
obtain the sampling variance of the full-sample estimate. The process of forming the replicate 
estimates is described below. 
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7.6.1 Defining Replicate Groups for Variance Estimation 

To form replicates for variance estimation, the sampled duster/hits in each Cluster Type 
2 or 3 state (that is, those states where not all schools were selected) were sorteu by monitor 
status, new-school status within monitor status, and finally by selection order within new-school 
status. The selection order used to form the replicate groups reflected the implicit stratification 
used in the selection of the sample of schools (see section 3.4.4). Within the sorted file, the 
basic algorithm for forming the replicate groups was to pair successive duster/hits, separately 
within the two monitor status categories. A monitored duster/hit was always paired with a 
monitored duster/hit, and an unmonitored duster/hit was always paired with an unmonitored 
duster/hit. All members (schools) of a duster/hit received the same pair code, and a substitute 
school received the pair code of the school it replaced. Double-session substitute schools were 
in effect assigned two pair codes, one corresponding to the original participating school and the 
other corresponding to the refusing school for which the extra sessions were conducted. 

Since the schools in the Cluster Type 1 states were certainty schools, they were sorted 
and paired differently. First, each school was assigned a ■half-group" code corresponding to the 
expected number of students selected from the school For the fourth-grade sample (not 
including Guam and the Virgin Islands), the value of the half-group code was set to 1 if the 
expected number of sample students in the school was less than 90; otherwise, the value of the 
half-group cod: was set to 2. For the eighth-grade sample (not including Guam and the Virgin 
Islands), the value of the half-group code was set to 1 if the expected number of sample students 
in the school was less than 45; 2 if the expected number of sample students in the school was 
between 45 and 74, inclusive; and 4 if the expected number of sample students in the school v as 
75 or greater. For schools in Guam, the values of the half-group code ranged from 2 to 8, 
depending on the estimated grade enrollment of the school; for schools in the Virgin Islands, the 
values of the half-group code ranged from 2 to 16, depending on the estimated grade enrollment 
of the school After assignment of the half-group codes, the schools within each Cluster Type 1 
state were sorted by monitor status, half-group code (descending order) within monitor status, 
and by the estimated grade enrollment of the school within half-group code. Note that the half- 
group code essentially specifies the number of variance estimation units to be created from the 
school For example, two clusters of students (Le., variance-estimation units) were created from 
each school having a half-group code of 2, four clusters of students (Le., variance-estimation 
units) were created from each school having a half-group code of 4; and so on. Each variance- 
estimation unit was a systematic sample of students within the school and successive variance- 
estimation units in the sorted file were paired to define the replicates. 

In some instances, there was an odd number of duster/hits (in the case of Cluster Type 
2 or 3 states) or variance-estimation units (in the case of Cluster Type 1 states) within a 
monitor-status category. If this occurred, the last “pair' within the monitor-status category 
actually consisted of three duster/hits or variance-estimation units. In general a single replicate 
was defined by randomly dropping a member (le., either a duster/hit or variance-estimation 
unit) of a given pair and then reweighting the remaining sample elements to compensate for the 
dropped unit. If the pair consisted of three units, two groups of two units each were randomly 
retained to form two replicates. 

The number of replicates formed in this manner depended on the number of pairs 
formed. Based on statistical and computer processing requirements, it was decided that 56 





replicates would be sufficient for the variance calculations. In a few states, there were more 
than 56 initial pairs using the procedures described above. In these states, it was necessary to 
combine some of the initial replicate groups to reduce the total number of replicates. In 
general, the goal was to combine an initial pair with another pair consisting of dissimilar schools 
within the same monitor-status category. 

In some states, fewer than 56 replicates were formed. In order to provide a uniform 
total of 56 replicates, additional sets of replicate weights were created simply by setting the 
additional sets equal to the set of full-sample weights. This procedure is unbiased and produces 
appropriate jackknifed sampling errors, while giving uniformity across states in the number of 
replicate weights. 



7.6.2 School-level Replicate Weights 



As mentioned above, each replicate sample had to be reweighted to compensate for the 
dropped unit(s) defining the replicate. For the Cluster T^P e 2 and 3 states, this reweighting was 
done in two stages. At the first stage, the ith school/hit included in a particular replicate r was 
assigned a replicate-specific school/hit base weight defined as follows: 

wfi-K'W? 



where Wf* is the full-sample base weight for school/hit i, and 






1.5 if school/hit i was contained in a “pair* consisting of 3 units from which 
the complementary member was dropped to form replicate r, 

2 if school/hit i was contained in a pair consisting of 2 units from which the 
complementary member was dropped to form replicate r, 

0 if school/hit i was dropped to form replicate r, 

1 otherwise. 



Using the leplicate-spedfic school/hit base weights, the school-level nonresponse 
weighting adjustments as described in section 7 33 were recalculated for each replicate r. That 
is, the school-level nonresponse adjustment factor for schools in replicate r and adjustment class 
h was computed as: 

la 

r<l) _ 
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where 



C A 13 the subset of school/hit records in adjustment class h; 
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= the replicate-r base weight of the ith school/hit in class h; 

E h - the QED grade enrollment for the ith school/hit in class h; 

1 if the ith school/hit in replicate r and adjustment class h 
participated in the assessments; and 

&{r)H ~ 

0 otherwise. 



The replicate-specific nonresponse-adjusted school/hit weight for the ith school in class h 
in replicate r was then computed as: 
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7.63 Student-level Replicate Weights 

For the Cluster Type 2 and 3 states, replicate-specific adjusted student base weights were 
calculated by multiplying the replicate-specific adjusted school/hit weights as described above by 
the corresponding within-school student weights. That is, following the procedures in section 
7.3.5, the adjusted student base weight for the jth student in adjustment class k in replicate r was 
initially computed as: 
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where 



Wfy, = the nonresponse-adjusted school/hit weight for school/hit i in school 
adjustment class h and replicate r, 

Wf** - the within-school weight for the jth student in school i. 

For the Cluster Type 1 states, the school-level nonresponse adjustment was not 
replicated since the schools in such states were selected with certainty. In this case, the 
replicate-specific adjusted student base weight for the jth student in adjustment class k in 
replicate r was calculated as: 



tw(2) 
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where 



= the overall nonresponse-adjusted school/hit weight for school/hit i in 
school adjustment class h ; 
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the replicate-specific within-school weight for the jth. student in school i 
K r W J* 4 * 



The factor K r in the above expression for the replicate-specific within-school weight 
compensates for the units dropped out in any given replicate (see section 7.6.1) and is defined 
by: 




for the students in school i who were in a pair from which the 
complementary variance-estimation unit was dropped to form replicate r, 

for the students in the variance-estimation unit that was dropped to form 
replicate r, 

otherwise. 



The final replicate-specific student weights were then obtained by applying the student 
nonresponse adjustment procedures (see section 7.3.5) to each set of replicate student weights. 
Let Ffy denote the student-level nonresponse adjustment factor for replicate r and adjustment 
class k. For the Cluster Type 2 and 3 states, the final replicate-r student weight for student j in 
school i in adjustment class k was calculated as: 

For the Cluster T^pe 1 states, the corresponding final replicate-r student weight for student j in 
school i in adjustment class k was calculated as: 

r (fjk w H, 

Estimates of the variance of sample-based estimates were calculated as follows: 

R 

Let $ * w S/ al x kv denote an estimated total based on the full sample, and let denote the 
H 

corresponding estimate based on replicate r. The jac’^mife variance estimate of t was 
calculated as: 

M 

varjgffi « £ C* w - » 



where R is the number of replicates. 
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7.7 Calculation of School Weights 

Since schools in the Cluster Type 1 states were selected with certainty, the “school/hit" 
weights described in section 7.3.3 can be used to estimate school-level characteristics and 
aggregates. However, these school/hit weights are not appropriate for the Cluster Type 2 and 3 
states because large schools had a chance of being selected more than once in the sampling 
process. To compensate for the possibility of multiple selections, schools in the Cluster Type 2 
and 3 states were assigned school weights, equal to: 

It 1 



where Wj? is the adjusted school/hit weight for school/hit i in adjustment class h t and where the 
sum extends over the school/hits corresponding to school s. Similarly, the replicate-specific 
school weights were computed as: 




where W^, is the replicate-specific adjusted school/hit weight defined in section 7.62. 
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Chapter 8 



THEORETICAL BACKGROUND AND PHILOSOPHY OF 
NAEP SCALING PROCEDURES 



Eugene G. Johnson, Robert J. Mislevy, and Neal Thomas 
Educational Testing Service 



8.1 OVERVIEW 

The primary method by which results from the Trial State Assessment are disseminated 
is scale-score reporting. With scaling methods, the performance of a sample of students in a 
subject area or subarea can be summarized on a single scale or series of subscales even when 
different students have been administered different items. This chapter presents an overview of 
the scaling methodologies employed h the analyses of the data from NAEP surveys in general 
and from the Trial State Assessment of mathematics in particular. Details of the scaling 
procedures specific to the Trial State Assessment are presented in Chapter 9. 



8.2 BACKGROUND 

The basic information from an assessment consists of the responses of students to the 
items presented in the assessment. For NAEP, these items are generated to measure 
performance on sets of objectives developed by nationally representative panels of learning area 
specialists, educators, and concerned citizens. Satisfying the objectives of the assessment and 
ensuring that the tasks selected to measure each goal cover a range of difficulty levels typically 
requires a large number of items. The Trial State Assessment of mathematics required 175 
items at grade 4 and 205 items at grade 8. To reduce student burden, each assessed student was 
presented only a fraction of the full pool of items using multiple matrix sampling procedures. 

The most direct manner of presenting the assessment results is to report percent correct 
statistics for each item. However, because of the vast amount of information, separate results 
for each of the items in the assessment pool hinders the comparison of the general perforo 
of subgroups of the population. Item-by-item reporting ignores overarching similarities in trends 
and subgroup comparisons that are common across items. 

It is useful to view the assessed items as random representatives of a conceptually 
infinite pool of items within the same domain and of the same type. In this random item 
concept, a set of items is taken to represent the domain of interest. An obvious measure of 
achievement within a domain of interest is the average percent correct across all presented 
items within that domain. The advantage of averaging is that it tends to cancel out the effects 
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of peculiarities in items that can affect item difficulty in unpredictable ways. Furthermore, 
averaging makes it possible to compare more easily the general performances of subpopulations. 

Despite their advantages, there are a number of significant problems with average 
percent correct scores. First, the interpretation of these results depends on the selection of the 
items; the selection of easy or difficult items could make student performance appear to be 
overly high or low. Second, the average percent correct metric is related to the particular items 
comprising the average, so that direct comparisons in performance between subpopulations 
require that those subpopulations have been administered the same set of items. Third, because 
this approach limits comparisons to percents correct on specific sets of items, it provides no 
simple way to report trends over time when the item pool changes. Finally, direct estimates of 
statistics such as the proportion of students who would respond correctly to 80 percent of the 
items in the pool are not possible when every student is administered only a fraction of the item 
pooL While the mean percent correct across all items in the pool can be readily obtained (as 
the average of the individual item percent correct statistics), distributional statistics, such 
quantiles of the distribution of scores across the full set of items, cannot be readily obtained 
without additional assumptions. 

These limitations can be overcome by the use of response scaling methods. If several 
items require similar skills, the regularities observed in response patterns can often be exploited 
to characterize both respondents and items in terms of a relatively small number of variables. 
These variables include a respondent-specific variable, called proficiency, which quantifies a 
respondent tendency to answer items correctly, and item-specific variables, which indicate 
characteristics of the item such as its difficulty, ability to distinguish between individuals with 
different levels of proficiency, and the chances of a very low proficiency respondent correctly 
answering the item. (These variables are discussed in more detail in the next section). When 
combined through appropriate mathematical formulas, these variables capture the dominant 
features of the data. Furthermore, all students can be placed on a common scale, even though 
none of the respondents take all of the items within the pool Using the scale, it becomes 
possible to discuss distributions of proficiency in a population or subpopulation and to estimate 
the relationships between proficiency and background variables. 

It is important to point out that any procedure of aggregation, from a simple average to 
a complex multidimensional scaling model, highlights certain patterns at the expense of other 
potentially interesting patterns that may reside within the data. Every item in a NAEP survey is 
of interest and can provide useful information about what young Americans know and can do. 
The choice of an aggregation procedure must be driven by a conception of just which patterns 
are salient for a particular purpose. 

The scaling for the Trial State Assessment was carried out separately within the five 
mathematics content areas specified in the framework and for items designed to measure skills 
in estimation. This scaling within subareas was done because it was anticipated that different 
patterns of performance might exist for these essential subdivisions of the subject area. Each 
content area scale corresponded to one of five content areas: Numbers and Operations; 
Measurement; Geometry; Data Analysis, Statistics, and Probability; and Algebra and Functions. 
By creating a separate scale for each of thvjse content areas, potential differences in 
subpopulation performance between the content areas are maintained. The separate estimation 
scale was created from an additional, special set of items measuring estimation skills (see 
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section 2.6). The separate estimation scale allows for the measurement of potential 
performance differences within that skill area relative to performance on the other scales. 
Analyses of the results for the separate scales from the 1992 Trial State Assessment and 
national mathematics assessment have shown that the separate scales provide additional 
information that a single scale cannot — for example, gender differences in mathematics 
performance by type of scale. 

The creation of a series of separate scales to describe mathematics performance does 
not preclude the reporting of an overall mathematics composite as a single index of overall 
mathematics performance. A composite is computed as the weighted average of the five content 
area scales, where the weights correspond to the relative importance given to each content area 
as defined by the objectives. The composite provides a global measure of performance within 
the subject area, while the constituent content area scales allow the measurement of important 
interactions within educationally relevant subdivisions of the subject area. 



83 SCALING METHODOLOGY 

This section reviews the scaling models employed in the analyses of data from the Trial 
State Assessment of mathematics and the 1992 national mathematics assessment, and the 
"plausible values* methodology that allows such models to be used with NAEP’s sparse item* 
sampling design. The reader is referred to Mislevy (1991) for an introduction to plausible 
values methods and a comparison with standard psychometric analyses, to Mislevy, Johnson and 
Muraki (1992) and Beaton and Johnson (1992) for additional information on how the models 
are used in NAEP, and to Rubin (1987) for the theoretical underpinnings of the approach. 

While the NAEP procedures were developed explicitly to handle the characteristics of 
NAEP data, they build on other research, and are paralleled fay other researchers. See, for 
example: Dempster, Laird, and Rubin (1977); Little and Rubin (1983,1987); Andersen (1980); 
Engelen (1987); Hoijtink (1991); Laird (1978); Lindsey, Clogg, and Grego (1991); Zwinderman 
(1991); Tanner and Wong (1987); and Rubin (1991). 

The 175 mathematics items administered at grade 4 and the 205 items administered at 
grade 8 in the Trial State Assessment were also administered to students of the same grades in 
the national mathematics assessment However, because the administration procedures differed, 
the Trial State Assessment data was scaled independently from the national data. The national 
data also included results for students in grade 12. Details of the scaling of the Trial State 
Assessment and the subsequent linking to the results from the national mathematics assessment 
are provided in Chapter 9. 



83.1 The Scaling Models 

Three distinct scaling models were used in the analysis of the data from the Trial State 
Assessment. Each of the models are based on item response theory (IRT; e.g., Lord, 1980). 
Each is a "latent variable" model, defined separately for each of the scales, and quantifying 
respondents’ tendencies to provide correct answers to the items contributing to a scale as a 
function of a parameter that is not directly observed, called proficiency on the scale. 
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A three-parameter logistic (3PL) model was used for the multiple-choice items. The 
fundamental equation of the 3PL model is the probability that a person whose proficiency on 
scale k is characterized by the unobservable variable 6 k will respond correctly to item/: 



P(Xj - 1| B k ,aj,b jt cp 



C , « - 9 

> 1 + exp[-1.7a y (8 t - bjj) 



( 8 . 1 ) 



= p sl m . 



where 

Xj is the response to item /, 1 if correct and U if not; 

a J where aj> 0, is the slope parameter of item j, characterizing its sensitivity 

to proficiency; 

bj is the threshold parameter of item /, characterizing its difficulty; and 

Cj where 0^c j <l, is the lower asymptote parameter of item j, reflecting the 

chances of students of very low proficiency selecting the correct option. 

Further define the probability of an incorrect response to the item as 

i V - Hxj - 0\\,a J ,b J ,e J ') - 1 - (8-2) 

A two-parameter logistic (2PL) model was used for short constructed-response items, 
which were scored correct or incorrect The form of the 2PL model is the same as equations 
(8.1) and (8.2) with the c, parameter fixed at zero. 

In addition to the multiple-choice and short constructed-response items, a number of 
extended constructed-response items (S at grade 4 and 6 at grade 8) were presented in the Trial 
State and national assessment Each of these items was scored on a multipoint scale with 
potential scores ranging from 0 to 4. Additionally, as discussed in Chapter 9, certain sets of 
items consisting of highly correlated parts were combined into "testlets" (Wainer & Kiely, 1987) 
where the score assigned to a testlet was the number of constituent parts answered correctly. 
Items which are scored on a multipoint scale are referred to as polytomous items, in contrast 
with the multiple-choice and short constructed-response items, which are scored 
correct /incorrect and referred to as dichotomous items. 

The polytomous items were scalui using a generalized partial credit model (Muraki, 
1992). The fundamental equation of this model is the probability that a person with proficiency 
6 k on scale k will have, for the jth polytomous item, a response Xj that is scored in the ith of rtij 
ordered score categories: 
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where 

mj is the number of categories in the response to item j 

Xj is the response to item j t with possibilities 0,l r ...,my-l 

Oj is the slope parameter 

bj is the item location parameter characterizing overall difficulty 

dji is the category i threshold parameter (see below). 

Indeterminades in the parameters of the above model are resolved by setting dp - 0 and 

Mj-l 

setting ^ <L * 0. Muraki (1992) points out that bj • is the point on the 0 t scale at which 
t-1 

the plots of Pj'uftiJ and P/6,) intersect and so characterizes the point on the B k scale at which 
the response to item j has the highest probability of incurring a change from response category 
i-1 to i. 



When mj • * 2, so that there are two score categories (0,1), it can be shown that PflJ of 
equation 83 for i*=0,l corresponds respectively to P^J and PjflJ of the 2PL model (equations 
8.1 and 8.2 with c,=*0). 

A typical assumption of item response theory is the conditional independence of the 
probabilities of correct response by an individual to a set of items, given the individual's 
proficiency. Hiat is, conditional on the individual's 0* the joint probability of a particular 
response pattern j * (x*. across a set cf n items is simply the product of terms based on 
(81), (83), and (83): 



■ *V-i 

PCs 1 6*, item parameters) - n n wr 

i-l 1-0 



(8.4) 



where Pj(6J is of the form appropriate to the type of item (dichotomous or polytomous), mj is 
taken equal to 2 for the dichotomously scored items, and Uj, is an indicator variable defined by 
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{ 1 if response x j was in category i 
0 otherwise. 

It is also typically assumed that response probabilities are conditionally independent of 
background variables (x), given or 

jP U 1 0*. item parameters, y) - /? Cs j item parameter?) (8.5) 

After £ has been observed, equation 8.4 can be viewed as a likelihood function, and 
provides a basis for inference about $ k or about item parameters. Estimates of item parameters 
were obtained by the NAEP BILOG/PARS CALE program, which combines Mislevy and Bock’s 
(1982) BILOG and Muraki and Bock’s (1991) PARSCALE computer programs, and which 
concurrently estimates parameters for all items (dichotomous and polytomous). The item 
parameters are then treated as known in subsequent calculations. The parameters of the items 
constituting each of the separate scales were estimated independently of the parameters of the 
other scales. Once items have been calibrated in this manner, a likelihood function for the scale 
proficiency 6 k is induced by a vector of responses to any subset of calibrated items, thus allowing 
phased inferences from matrix samples. 

Item parameter estimation was performed separately for the grade 4 and the grade 8 
data. As stated previously, item parameter estimation was performed independently for the 
Trial State Assessment and for the national mathematics assessment. In both cases, the 
identical scale definitions were used. 

In all NAEP IRT analyses, missing responses at the end of each block a student was 
administered were considered "not-reached,” and treated as if they had not been presented to 
the respondent. Missing responses to dichotomous items before the last observed response in a 
block were considered intentional omissions, and treated as fractionally correct at the value of 
the reciprocal of the number of response alternatives. These conventions are discussed by 
Mislevy and Wu (1988). With regard to the handling of not-reached items, Mislevy and Wu 
found that ignoring not-reached items introduces slight biases into item parameter estimation to 
the degree that not-reached items are present and speed is correlated with ability. With regard 
to omissions, they found that the method described above provides consistent limited- 
information likelihood estimates of item and ability parameters under the assumption that 
respondents omit only if they can do no better than responding randomly. 

Because the extended constructed-response items were always the last item in a block 
and because considerably more effort was required of the student to answer these items, 
nonresponse to an extended constructed-response item was considered an intentional omission 
(and scored as the lowest category, 0) unless the student also did not respond to the item 
immediately preceding that item. In that case, the extended constructed-response item was 
considered not reached and treated as if it had not been presented to the student. 

Although the IRT models are employed in NAEP only to summarize performance, a 
number of checks are made to detect serious violations of the assumptions underlying the 
models (such as conditional independence), and, when warranted, remedial efforts are made to 
mitigate the effects of such violations on inferences. These checks include comparisons of 







empirical and theoretical item response functions to identify items for which the IRT model may 
provide a poor lit to the data. 

Scaling areas ir; NAEP are determined a priori by considerations of content as 
collections of items for which' overall performance is deemed to be of interest, as defined by the 
frameworks developed by the National Assessment Governing Board. A proficiency scale 0 k is 
defined a priori fay the collection of items representing that scale. What is important, therefore, 
is that the models capture salient variation in the response data to effectively summarize the 
overall performance on the content area of the populations and subpopulations being assessed 
Because of the a priori definition of the latent proficiency variable, departure from conditional 
independence tends to cancel out over items and does not seriously affect the estimation of 
whole group and subpopulation distributions, except when substantial differential item 
functioning (DIF) is found simultaneously for many items. NAEP has routinely conducted DIF 
analyses to guard against potential biases in making subpopulation comparisons based on the 
proficiency distributions. 

The local independence assumption embodied in equation 8.4 implies that item response 
probabilities depend only on 0 and the specified item parameters, and not on the position of the 
item in the booklet, the content of items around an item of interest, or the test-administration 
timing conditions. However, these effects are certainly present in any application. The practical 
question is whether inferences based on the IRT probabilities obtained via 8.4 are robust with 
respect to the ideal assumptions underlying the IRT modeL Our experience with the 1986 
NAEP reading anomaly has shown that for measuring small changes over time, changes in item 
context and speededness conditions can lead to unacceptably large random error components. 
These m be avoided by presenting items used to measure change in identical test forms, with 
identical timings and administration conditions. Thus, we do not maintain that the item 
parameter estimates obtained in any particular booklet configuration are appropriate for other 
conceivable configurations. Rather, we assume that the parameter estimates are context-bound. 
(For this reason, we prefer common population equating to common item equating whenever 
equivalent random samples are available for linking.) This is the reason that the data from the 
Trial State Assessment were calibrated separately from the data from the national 
NAEP — since the administration procedures differed somewhat between the Trial State 
Assessment and the national NAEP, the values of the item parameters could be different. 
Furthermore, to allow for the possibility that item parameters could change over time, the 1992 
grade 8 calibrate i was conducted separately from the data from the 1990 grade 8 Trial State 
Assessment. Chapter 9 provides details on the procedures used to link the results of the 1992 
Trial State Assessment to those of the 1992 national assessment and, hence, to those of the 1990 
Trial State and National Assessments. 



L.3.2 An Overview of Plausible Values Methodology 

Item response theory was developed in the context of measuring individual examinees’ 
abilities. In that setting, each individual is administered enough items (often 100 or more) to 
permit precise estimation of his or her 0, as a maximum likelihood estimate 3, for example. 
Because the uncertainty associated with each 0 is negligible, the distribution of 0, or the joint 
distribution of 0 with other variables, can then be approximated using individuals* $ values as if 
they were 0 values. 




This approach breaks down in the assessment setting when, in order to provide broader 
content coverage in limited testing time, each respondent is administered relatively few items in 
a scaling area. Hie problem is that the uncertainty associated with individual 0$ is too large to 
ignore, and the features of the 9 distribution can be seriously biased as estimates of the 0 
distribution. (Hie failure of this approach was verified in early analyses of the 1984 NAEP 
reading survey; see Wingersky, Kaplan, & Beaton, 1987.) "Plausible values" were developed as a 
way to estimate key population features consistently, and approximate others no worse than 
standard 1RT procedures would. A detailed development of plausible values methodology is 
given in Mislevy (1991). Along with theoretical justifications, that paper presents comparisons 
with standard procedures, discussions of biases that arise in some secondary analyses, and 
numerical examples 

Hie following provides a brief overview of the plausible values approach, focusing on its 
implementation in the Trial State Assessment analyses. 

Let y represent the responses of all sampled examinees to background and attitude 
questions, along with design variables such as school membership, and let £ represent the 
subscale proficiency values. If £ were known for all sampled examinees, it would be possible to 
compute a statistic tffLy ) — such as a subscale or composite subpopulation sample mean, a 
sample percentile point, or a sample regression coefficient— to estimate a corresponding 
population quantity T. A function U&y) — e.g., a jackknife estimate— would be used to gauge 
sampling uncertainty, as the variance of t around Tin repeated samples from *he population. 

Because the scaling models are latent variable models, however, fl values are not 
observed even for sampled students. To overcome this problem, we follow Rubin (1987) by 
considering £ as "missing data" and approximate ttfiy) by its expectation given foy), the data that 
actually were observed, as follows: 

t'&y) = Emy)\z,y! 

« ! t&X)p(Z\&x)dZ . ( 8 . 6 ) 

It is possible to approximate f* using random draws from the conditional distribution of 
the scale proficiencies given the item responses x h background variables y„ and model 
parameters for sampled student i. These values are referred to as "imputations" in the sampling 
literature, and "plausible values" in NAEP. The value of £ for any respondent that would enter 
into the computation of t is thus replaced by a randomly selected value from their conditional 
distribution. Rubin (1987) proposes that this process be carried out several times— "multiple 
imputations" — so that the uncertainty associated with imputation can be quantified. The 
average of the results of, for example, M estimates of f, each computed from a different set of 
plausible values, is a Monte Carlo approximation of (8.6); the variance among them, B t reflects 
uncertainty due to not observing 0, and must be Added to the estimated expectation of U(fiy), 
which reflects uncertainty due to testing only a sample of students from the population. Section 
8.4 explains how plausible values are used in subsequent analyses. 

It cannot be emphasized too strongly that plausible values are not test scores for 
individuals in the usual sense. Plausible values are offered only as intermediary computations 




for calculating integrals of the form of equation 8.6, in order to estimate population 
characteristics. When the underlying model is correctly specified, plausible values will provide 
consistent estimates of population characteristics, even though they are not generally unbiased 
estimates of the proficiencies of the individuals with whom they are associated. The key idea 
lies in a contrast between plausible values and the more familiar 6 estimates of educational 
measurement that are in some sense optimal for each examinee (e.g., maximum likelihood 
estimates, which are consistent estimates of an examinee's 0, and Bayes estimates, which provide 
minimum mean-squared errors with respect to a reference population): Point estimates that are 
optimal for individual examinees have distributions that can produce decidedly nonoptimal 
(specifically, inconsistent) estimates of population characteristics (Little & Rubin, 1983). Plausible 
values, on the other hand, are constructed explicitly to provide consistent estimates of 
population effects. 



8 33 Computing Plausible Values In IRT-based Scales 

Plausible values for each respondent i are drawn from the conditional distribution 
p(Qj\x it y ( ,T£) t where T and E are regression model parameters defined in this subsection. This 
subsection describes how, in IRT-based scales, these conditional distributions are characterized, 
and how the draws are taken. An application of Bayes* theorem with the iRT assumption of 
condition:*! independence produces 

pdlx^TX) * P(x t \&,y l ,TX)p® J \y i ,YX) = P(x^pft|y,,r,EJ , (8.7) 

where, for vector-valued & P(x t \fb) is the product over scales of the independent likelihoods 
induced by responses to items within each scale, and p^|y f ,r,E) is the multivariate— and 
generally nonindependent — •joint density of proficiencies for the scales, conditional on the 
observed value y, of background responses, and the parameters T and E. The scales are 
determined by the item parameter estimates that constrain the population mean to zero and 
standard deviation to one. The item parameter estimates are fixed and regarded as population 
values in the computation described in this subsection. 

In the analyses of the data from the Trial State Assessment and the data from the 
national mathematics assessment, a normal (Gaussian) form was assumed fer />$f|y*r,E), with a 
common variance, E, and with a mean given by a linear model with slope parameters, F, based 
on the first 98 to 154 principal components of 258 (grade 4) and 303 (grade 8) selected main- 
effects and two-way interactions of the complete vector of background variables. The included 
principal components will be referred to as the conditioning variables, and will be denoted /. 

(The complete set of original background variables used in the Trial State Assessment analyses 
are listed in Appendix C.) The following model w.\s fit to the data within each state: 

fi = ry + e, (8.8) 

where e is normally distributed with mean zero an . variance E. The number of principal 
components of the conditioning variables used for each state was sufficient to account for 90 
percent of the total variance of the full set of conditioning variables (after standardizing each 
variable). As in regression analysis, T is a matrix each of whose columns is the effects for one 




scale and E is the matrix variance of residuals between subscales. By fitting the model (8.8) 
separately within each state, interactions between each state and the conditioning variables are 
automatically included in the conditional joint density of scale proficiencies. 

Maximum likelihood estimates of T and E, denoted by f* and E, are obtained from 
Sheehan’s (1985) MGROUP computer program using the EM algorithm described in Mislevy 

(1985). The EM algorithm requires the computation of the mean, 0 jf and variance, Ef, of the 

posterior distribution in (8.7). These moments are computed using higher order asymptotic 
corrections (Thomas, 1992). 

After completion of the EM algorithm, the plausible values are drawn in a three-step 
process from the joint distribution of the values of T for all sampled respondents. First, a value 

of T is drawn from a normal approximation to P(I\E|x.,y.) that fixes E at the value t, (Thomas, 

1992). Second, conditional on the generated value of T (and the fixed value of E * t), the 

mean, 0., and variance, Ef, of the posterior distribution in equation 8.7 (Le., p($ J \x i ,y i ,T£}) are 

computed using the same methods applied in the EM algorithm. In the third step, the 0. are 

drawn independently from a multivariate normal distribution with mean 0 t and variance Ef , 
approximating the distribution in (8.7). These three steps are repeated five times producing five 
imputations of 0 t for each sampled respondent. 



8.4 ANALYSES 

When survey variables are observed without error from every respondent, standard 
variance estimators quantify the uncertainty associated with sample statistics from the only 
source of the uncertainty, namely the sampling of respondents. Item percents correct for NAEP 
cognitive items meet this requirement, but scale-score proficiency values do not The IRT 
models used in their construction posit an unobservable proficiency variable 0 to summarize 
performance on the items in the subarea. The fact that 0 values are not observed even for the 
respondents in the sample requires additional statistical analyses to draw inferences about 0 
distributions and to quantify the uncertainty associated with those inferences. As described 
above, Rubin’s (1987) multiple imputations procedures were adapted to the context of latent 
variable models to produce the plausible values upon which many analyses of the data from the 
Trial State Assessment were based. This section describes how plausible values were employed 
in subsequent analyses to yield inferences about population and subpopulation distributions of 
proficiencies. 



8.4.1 Computational Procedures 

Even though one does not observe the 0 value of respondent l, one does observe 
variables that are related to it: the respondent’s answers to the cognitive items he or she was 

administered in the area of interest, and y„ the respondent's answers to demographic and 
background variables. Suppose one wishes to draw inferences about a number T(fhX) that could 




be calculated explicitly if the 0 and y values of each member of the population were known. 
Suppose further that if $ values were observable, we would be able to estimate 7 from a sample 
of N pairs of 6 and y values by the statistic t&yj [where (fly) « and that we 

could estimate the variance in t around 7 due to sampling respondents by the function Ufay). 
Given that observations consist of (x it y) rather than we can approximate t by its expected 
value conditional on (&y), or 

f fay) = E[t(k>y)\&y] = J tflLy) pflL\&y) d& . 

It is possible to approximate t* with random draws from the conditional distributions 

t (Si which are obtained for all respondents by the method described in section 833, Let 
„ be the mth such vector of "plausible values," consisting of a multidimensional value for the 
latent variable of each respondent. This vector is a plausible representation of what the true & 
vector might have been, had we been able to observe it. 

The following steps describe how an estimate of a scalar statistic tf&yj and its sampling 
variance can be obtained from M (> 1) such sets of plausible values. (Five sets of plausible 
values are used in NAEP analyses of the Trial State Assessment.) 

1) Using each set of plausible values in turn, evaluate t as if the plausible values 
were true values of £. Denote the results k for m= 1,.. Jlf. 

2) Using the jackknife variance estimator defined in Chapter 7, compute the 
estimated sampling variance of denoting the result U m , 

3) The final estimate of t is 

u 

M-l 




4) Compute the average sampling variance over the M sets of plausible values, to 
approximate uncertainty due to sampling respondents: 

JL U 

U* a 

ti M 



5) Compute the variance among the M estimates to approximate uncertainty due 

to not observing $ values from respondents: 

u (i - r *) 2 

B * £15 L 

“ h w - 1 ) 



The final estimate of the variance of f* is the sum of two components: 
V - V' * (1 ♦ M-') 



6 ) 




Note: Due to the excessive computation that would be required, NAEP analyses did not 
compute and average jackknife variances over all five sets of plausible values, but only on 
the first set. Thus, in NAEP reports, if is approximated by Uj. 



8.4 2 Statistical Tests 

Suppose that if 0 values were observed for sampled students, the statistic (t - T)fU m 
would follow a /-distribution with d degrees of freedom. Then the incomplete-data statistic 
(f - T)/V tn is approximately /-distributed, with degrees of freedom given by 



4r , o -ff 

M - 1 d 

where f u is the proportion of total variance due to not observing 0 values: 

Su - Q+M*)BJV W 

When B m is small relative to CT, the reference distribution for incomplete-data statistics 
differs little from the reference distribution for the corresponding complete-data statistics. This 
is the case with main NAEP reoorting variables. If, in addition, d is large, the normal 
approximation can be used to flag "significant” results. 

For fc-dimensional r, such as the it coefficients in a multiple regression analysis, each U m 
and If is a covariance matrix, and B u is an average of squares and cross-products rather than 
simply an average of squares. In this case, the quantity (T-t 9 ) V' 1 (T-t*)* is approximately F 
distributed, with degrees of freedom equal to k and v, with v defined as above but with a matrix 
generalization of f u : 



f M = (1+M 1 ) Trace <B u Vj)/k . 

By the same reasoning as used for the normal approximation for scalar f, a chi-square 
distribution on k degrees of freedom often suffices. 



8.43 Biases in Secondary Analyses 

Statistics t that involve proficiencies in a scaled content area and variables included in 
the conditioning variables /, are consistent estimates of the corresponding population values T. 
Statistics involving background variables y that were not conditioned on, or relationships among 
proficiencies from different content areas, are subject to asymptotic biases whose magnitudes 
depend on the type of statistic and the strength of the relationships of the nonconditioned 
background variables to the variables that were conditioned on and to the proficiency of interest. 
That is, the large sample expectations of certain sample statistics need not equal the true 
population parameters. 




The direction of the bias is typically to underestimate the effect of nonconditioned 
variables. For details and derivations see Beaton and Johnson (1990), Mislevy (1991), and 
Mislevy and Sheehan (1987, section 103.5). For a given statistic f involving one content area 
and one or more nonconditioned background variables, the magnitude of the bias is related to 
the extent to which observed responses x account for the latent variable 0, and the degree to 
which the nonconditioned background variables are explained by conditioning background 
variables. The first factor— conceptually related to test reliability — acts consistently in that 
greater measurement precision reduces biases in all secondary analyses. The second factor acts 
to reduce biases in certain analyses but increase it in others. In particular, 

• High shared variance between conditioned and nonconditioned background 
variables mitigates biases in analyses that involve only proficiency and 
nonconditioned variables, such as marginal means or regressions. 

• High shared variance exacerbates biases in regression coefficients of conditional 
effects for nonconditioned variables, when nonconditioned and conditioned 
background variables are analyzed jointly as in multiple regression. 

The large number of background variables that have been included in the conditioning 
vector for the Trial State Assessment allows a large number of secondary analyses to be carried 
out with little or no bias, and mitigates biases in analyses of the marginal distributions of 0 in 
nonconditioned variables. Kaplan and Nelson’s analysis of the 1988 NAEP reading data (some 
results of which are summarized in Mislevy, 1991), which had a similar design and fewer 
conditioning variables, indicate that the potential bias for nonconditioned variables in multiple 
regression analyses is below 10 percent, and biases in simple regression of such variables is 
below 5 percent. Additional research (summarized in Mislevy, 1990) indicates that most of the 
bias reduction obtainable from conditioning on a large number of variables can be captured by 
instead conditioning on the first several principal components of the matrix of all original 
conditioning variables. This procedure was adopted for the Trial State Assessment by replacing 
the conditioning effects by the first K principal components, where K was selected so that 90 
percent of the total variance of the full set of conditioning variables (after standardization) was 
captured. Mislevy (1991) shows that this puts an upper bound of 10 percent on the average bias 
for all analyses involving the original conditioning variables. 



85 SCALE ANCHORING AND ACHIEVEMENT LEVELS 

Since its beginning, a goal of NAEP has been to inform the public about what students in 
American schools know and can do. While the NAEP scales provide information about the 
distributions of proficiency for the various subpopulations, they do not directly provide 
information about the meaning of various points on the scale. Traditionally, meaning has been 
attached to educational scales by norm-referencing — that is, by comparing students at a 
particular scale level to other students. In contrast, NAEP scale anchors and achievement levels 
describe selected points on the scale in terms of the types of skills that are or should be 
exhibited by students scoring at that level Both the scale anchoring and the achievement level 
processes were applied to the 1992 national NAEP mathematics composite. However, since the 
Trial State Assessment scales were linked to the national scales, the interpretations of the 
selected levels also apply to the Trial State Assessment. 




As applied to the 1992 mathematics data, scale anchoring began by identifying four 
anchoring levels on the mathematics composite: 200, 250, 300, 350. The next step was to 
identify items that a large majority (at least 65 percent) of students at a given anchor level could 
answer correctly but that most students (at least 50 percent) at the next lower level answered 
incorrectly. Additionally, there had to be at least a 30 percentage point difference in the 
probabilities of success between the two levels. The result was a grouping of assessment items 
by the levels between which they discriminate. These anchor items were then reviewed by 
subject area experts who, using their knowledge of mathematics and student performance, 
generalized from the items to descriptions of the types of skills exhibited at each level Further 
details of the anchoring process appear in Appendix F. 

The National Assessment Governing Board has determined that achievement levels shall 
be the first and primary way of reporting NAEP results. Setting achievement levels is a method 
for setting standards on the NAEP assessment that identifies what students should know and be 
able to do at various points on the mathematics composite. For each grade, three levels were 
defined— basic, proficient, and advanced. Based on initial policy definitions of these levels, 
panelists were asked to determine operational descriptions of the levels appropriate with the 
content and skills assessed in the mathematics assessment. With these descriptions in mind, the 
panelists were then asked to rate the assessment items in terms of the expected performance of 
marginally acceptable examinees at each of these three levels. These ratings were then mapped 
onto the NAEP scale and adjusted downward one standard error of the mean panelist rating to 
obtain the achievement level cutpoints for reporting. Further details of the achievement level 
setting process appear in Appendix G. 




Chapter 9 



DATA ANALYSIS AND SCALING FOR 
THE 1992 TRIAL STATE ASSESSMENT IN MATHEMATICS 



John Mazzeo, Huahua Chang, Edward Kulick, Y. Fai Fong, and Angela Grima 

Educational Testing Service 



9.1 OVERVIEW 

This chapter describes the analyses carried out in the development of the 1992 Trial 
State Assessment mathematics scales. The procedures used were similar to those employed in 
the analysis of the 1990 Trial State Assessment (Mazzec, 1991) and are based on the 
philosophical and theoretical underpinnings described in the previous chapter. However, the 
1990 methods needed to be extended in a number of ways to accommodate the evolving nature 
of NAEF in general and the Trial State Assessment in particular. The changes incorporated 
into the 1992 Trial State Assessment included the assessment of both fourth-grade and eighth- 
grade samples for each jurisdiction, the addition of items measu^g estimation skills, and the 
introduction of extended constructed-response items. 

There were five major steps in the analysis of the Trial State Assessment mathematics 
data, each of which is described in a separate section: 

• conventional item and test analyses (section 93 ); 

• item response theory (1RT) scaling (section 9.4); 

• estimation of state and subgroup proficiency distributions based on the "plausible 
values" methodology (section 9.5); 

• linking of the 1992 Trial State Assessment scales to the corresponding scales from 
the 1992 national assessment (section 9.6); and 

• creation of the Trial State Assessment mathematics compasite scale (section 9.7). 

To set the context within which to describe the methods and results of scaling 
procedures, a brief review of the assessment instruments and administration procedures is 
provided. 
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9 2 DESCRIPTION OF ITEMS, ASSESSMENT BOOKLETS, AND ADMINISTRATION 

PROCEDURES 

The general design structure of the 1992 Trial State Assessment was the same as that 
used in 1990. However, the particulars of the 1992 design differed in several respects from 
those of 1990. First, the 1990 assessment was administered to eighth-grade students only, while 
the 1992 assessment included samples of both fourth- and eighth-grade public-school students. 
Second, the 1992 assessment used a somewhat different set of instruments from those used in 
1990. The 1992 item pool was based on the same curriculum framework used for 1990 national 
and Trial State Assessments and contained four blocks of items at each grade level that were 
identical to blocks administered in 1990. However, the 1992 item pool included an expanded 
number of blocks containing new material including a greater proportion of the conventional 
short constructed-response items and 11 newly developed extended constructed-response items. 
Each extended constructed-response item required about five minutes to complete and was 
scored on a 0-to-4 scale. All extended constructed-response items appeared as the last item In 
their respective blocks. The 1992 item pool also included a block of items measuring estimation 
skills. This estimation block had been included in the 1990 national assessment but not in the 
1990 Trial State Assessment 

The fourth-grade item pool contained 175 items. Of these, 155 were categorized into 
one of the five content areas: 63 items for Numbers and Operations, 29 items for 
Measurement, 27 for Geometry, 20 for Data Analysis, Statistics, and Probability, and 16 for 
Algebra and Functions. These items, consisting of 95 multiple-choice items, 53 short 
constructed-response items, 5 extended constructed-response items, and 2 "testlets" 1 were 
divided into 13 mutually exclusive blocks. The composition of each block of items, in terms of 
content and format, is given in Table 9-1 3 . An additional 20 multiple-choice items measuring 
estimation abilities were assembled into a separate block of items. 

The eighth-grade item pool contained 205 items. One-hundred and eighty-three items, 

55 of which were common to the fourth grade, were classified into the five content areas as 
follows: 58 items for Numbers and Operations, 32 items for Measurement, 36 for Geometry, 28 
for Data Analysis, Statistics, and Probability, and 29 for Algebra and Functions. These 183 
items, consisting of 116 multiple-choice items, 59 short constructed-response items, 6 extended 
constructed-response items, and 2 testlets, were divided into 13 mutually exclusive blocks. The 
composition of each block of items, in terms of content and format, is given in Table 9-2. An 
additional 22 multiple-choice items measuring estimation abilities were assembled into a 
separate block of items. Ten of these items were also used at the fourth grade. 

Twelve of the 14 fourth-grade blocks contained one or more constructed-response items; 
one block consisted entirely of constructed-response items. Twelve of the 14 eighth-grade blocks 



'A testlet is a group of items (in the case ofNAEP, typically three or four items) that are related to a single content area, 
topic, or stimulus and are developed and scored as a single unit (see Wainer & Kiely, 1987, for further details and examples 
of different types of testlets). 

^Thc numbers in Tables 9-1 and 9-2 differ slightly from those given in Chapter 2. The numbers in Chapter 2 do not 
reflect the grouping of certain sets of items into testlets for the purposes of scaling. 
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1992 NAEP Mathematics Block Composition by Scale and Item Type' 

Grade 4 










£ 

wM 






OOOOOOiHOOOOrHOO 


CM 


“5 


- 


0000’-<0'»-t000^* _ l’ _ <0 


in 




« 




8 








§ 


C 

s| 


- 


oooooooooooooe^ 




a 


ID 




s 


I 

£ 




iTTiiiiliiniiil 


rH 






. 






OOOMOOOOOrHO^rHO 


VI 




rn 


«-ifMmO'-<*~<»'MOOOOOO© 


O' 




t« 


p|HHrt(Srtpl«rttONOrtO 


a 


If 




oooocooooooooo 


o 




m 


OOOOOOOOOOOOtHO 


rH 








fi 

« * 


«s 


*"»oo»-ti-<©©*-io»cM*H©o© 


O' 


i 


** 


MrtrtOrtrlNOrtrtrlOOO 


o 

-•-s 




H 


H(snnHNi-Mfinr<mrioo 


R 






oooooooooooooo 


o 


f 




OOOOOOWOOOOOOO 


iH 


* 


M 


©©(Mcnooo'rooc'jooo 


CN 

H 






*HcM^O«-(dOO»««-<©«-<0© 


TT 

rn 




E-< 


lOMtHHN^rtNHNH^O 


ft 


| 


* 




?“4 


I 




oooooooooooooo 


o 


8 

2 


« 


*“»©CM»-t»H©0»^0©00«-IO 


r* 







^NNOONNONHN'HmO 


1-1 

cs 


s 


D 




@ 


i 

* 




oooooooooooooo 


o 


£ 


** 


oooooooooooooo 


CM 


j 

f 


« 


NOO^HrtNOM't HrtHO 


a 


sk 


P4 


<st'\oomoo»-<omNrttn«o 


H 


IF 


ssssslssssssss 


T9 

s 







u 



rr 

oo 



C-2 



21 i 
























Table 9-2 




| 

1 

1 

V3 

X 

•O 



•a 



CO 

u 

■8 



CQ 



<« 

1 

2 

Oh 





H 


|| tassEaa^a^s^as g 






>-< 000^000000000 


CS 


"i 

i 


* 


»-( 000 *-( 0 *-» 00 >-lr-(»H 00 


SO 




N 


rtOtnjgiflNWN'ONtN^O 


8 




*4 


| »Hj S o Vg v l o a ««'OQgj g 


-i i 




OOOOOOOOOOOOOOj 


a 


21 




2 


► 


W^rt(*lNMHOPlPI(SH(flO 


a 


I 

X 




OOOOOOOOOOOOOO 


o 


£ 

fi 


« 


OOOOOOOOOi-lOOOO 


r* 


1 


IS 


-*oor^— < ooooo*ho>ho 


p- 


i 




NVmOHrtHONHHHNO 


CN 


i 


t- 


NNrtlflrtrtrtH'ifHrtHrtO 


8 


if 


* 


riOOOt-lOOOOOOOOO 


(S 






OOOOOOOOOOOOOO 


>H 


££ 








« 


« 


000 r^>-<(S 0 »-trS 000 T -<0 


2 




•4 

" T 


HNWOHHNOlSrtrtHOO 


a 




H 


H'Cm'OMNH>r'rtNNH(NO 


8 




"■■ 


OOOOOOOOOOOOOO 


o 


f 


«>> 


OOOOOOOOOOOOOO 


H 




N 


(piOfSJO^OO^-OoHOOOO 


2 




9H 


O^rtOrlNHOWHNripIO 


8 




H 


(N^WiHHNHNnHOKN'flO 


CO 


| 


♦ 


OOOOOOOOOOOOOO 


o 


l 


•n 


OOOOOOOOOOOOOO 




s 


w 


oof'ioooorJoooooo 


a 




tm 


Hdf NC OMrt QNoONinO 


a 


s 


H 


»or*r-rri'>roofOo>of<)’>#-rfkno 


8 


1 


•<» 


OOOOOOOOOOOOOO 


o 


<9 


* 


iHOOOOOOOOOO — *00 


CN 


£ 


N D 


ooom>-<omocoocN<N>-<o 


a 




H 


c-hhOwosoocsnNH^o 




Block 


21 Hr 2 « r* *> ON 2 5J a o s a s 

gsSsssssssssss 


j Total 



•a 

•w 

£ 

u 



u 

■3 

11 

IT 

a 

I 

0*3 

Q> 

lH 

i 

b 

1 

•o 

u 

*8 

o 

2 

c 

co 



■8 

5 

2 

I 



ws 

co 

B 

<s 

8 

'o 

•f 

it 

ts, 

2 

3 

£ 



Cs! 






c\j 





also contained one or more constructed-response items; two blocks consisted entirely of 
constructed-response items. The questions contained in the constructed-response block at fourth 
grade and one of the two constructed-response blocks at eighth grade required the manipulation 
of geometric shapes for their solution. Students assigned these blocks were provided a packet 
containing the necessary shapes during the time period in which they worked on these items. 
These and all other constructed-response items were scored by specially trained readers, as 
described in Chapter 5. 

At grade 4, 37 items required the use of a calculator for their solutions. These items 
appeared in three of the blocks (15 items in block M8, 12 in M12, and 10 in M14). At grade £, 
37 calculator items appeared in three blocks (18 items in block M8, 9 in M12, and 10 in M14). 
Each student assigned a block containing calculator items was given a Texas Instruments 
calculator (a TT-1Q8 four-function calculator at grade 4, a Tl-30 scientific calculator at grade 8) 
to use while he or she was working on that block. For each calculator item, both fourth- and 
eighth-grade students were asked to indicate whether they had in fact used the calculator to 
answer each item in the blocks for which calculators were made available. 

One item at grade 4 required the use of a ruler; five items at grade 8 required the use of 
a protractor/ruler. Students administered the(se) item(s) were provided with the necessary 
tools for the 15-minute period they worked on the block containing the item(s). 

There were a total of 27 assessment booklets for each grade. The block of estimation 
items was assembled into a single booklet. The 13 non-estimation blocks were used to form 26 
different booklets according to a balanced incomplete block (BIB) design (see Chapter 2 for 
details). Each of these booklets contained three blocks of items, and each block of items 
appeared in exactly six booklets. To balance possible block position effect, each block appeared 
twice as the first block of mathematics items, twice as the second block, and, twice as the third 
block. In addition, the BIB design required that each block of items be paired in a booklet with 
every other block of items exactly once. 

The design of the 1992 state mathematics assessment required that each student be 
administered two booklets— one of the 26 booklets in the BIB design, followed by the 
estimation booklet. Within each administration site, all booklets except the estimation booklet 
were "spiraled" together in a random sequence and distributed to students sequentially, in the 
order of the students 1 names on the Student Listing Form (see Chapter 4). As a result of the 
BIB design and the spiraling of booklets, a considerable degree of balance was achieved in the 
data collection process. With the exception of the estimation block, each block of items (and, 
therefore, each item) was administered to randomly equivalent samples of students of 
approximately equal size (i.e., about 6/26 of the total sample size) within each jurisdiction and 
across all jurisdictions. In addition, within and across jurisdictions, randomly equivalent samples 
of approximately equal size received each particular block of items as the first, second, or third 
block within a booklet. The full sample of students in each jurisdiction was administered the 
estimation booklet and all students attempted these items after completing on^ of the BIB 
booklets. 

As described in Chapter 4, a randomly selected half of the administration sessions within 
each state were observed by Westat-trained quality control monitors. Thus, within and across 
states, randomly equivalent samples of students received each block of items in a particular 
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position within a ‘-ooklet under monitored and unmonitored administration conditions. 
Randomly equivalent samples of student within and across states also were administered the 
estimation items under monitored and unmonitored administration conditions. 



93 ITEM ANALYSES 

93.1 Conventional Item and Test Analyses 

Tables 9-3 and 94 contain summary statistics for each block of items for the fourth and 
the eighth grades respectively. Block-level statistics are provided both overall and, for all but 
the estimation block, by serial position of the block within booklet. To produce these tables, 
data from all 44 jurisdictions were aggregated and statistics were calculated using rescaled 
versions of the final sampling weights provided by Westat. The rescaling, carried out within 
each jurisdiction, constrained the sum of the sampling weights within that jurisdiction to be 
equal to its sample size. Use of the rescaled weights does nothing to alter the value of statistics 
calculated separately within each jurisdiction. However, for statistics obtained from samples that 
combine students from different jurisdictions, use of the rescaled weights results in a roughly 
equal contribution of each jurisdiction's data to the final value of the estimate. As discussed in 
Mazzeo (1991), equal contribution of each jurisdiction's data to the results of the IRT scaling 
was viewed as a desirable outcome and, as described in the scaling section below, these same 
rescaled weights were used in carrying out that scaling. Hence, the item analysis statistics shown 
in Tables 9-3 and 9-4 are cohjistent with the weighting used in scaling. 

Tables 9-3 and 94 show the number of students assigned each block of items, the 
average item score, the average biserial correlation, and the proportion of students z tempting 
the last item in that block. The average item score for the block is the average, over items, of 
the score means for each of the individual items in the block, For binary-scored multiple-choice 
and constructed-response items, *hese score means correspond to the proportion of students 
who correctly answered each itt For the testlets and extended constructed-response items, 
the score means were calculated as item score mean divided by the maximum number of points 
possible. 

In NAEP analyses (both conventional and IRT-based), a distinction is made between 
missing responses at the end of each block (Le., misskg responses subsequent to the last item 
the student answered) and missing responses prior to the last observed response. Missing 
responses before the last observed response are considered intentional omissions. In calculating 
the average score for each item, only students classified as having been presented the item were 
included in the denominator of the statistic. Intentional omissions were treated as incorrect 
responses. Missing responses at the end of the block are considered "not-reached," and treated 
as if they had not been presented to the student The proportion of students attempting the last 
item of a block (or, equivalei tly, 1 minus the proportion of student not reaching the last item) is 
often used as an index of the degree of speededness associated with the administration of that 
block of items. 

Standard practice at ETS is to treat all nonrespondents to the last item as if they had not 
reached the item. For multiple-choice and standard constructed-response items, the use of such 
a convention most often produces a reasonable pattern of results in that the proportion reaching 
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Descriptive Statistics for Each Block of Items by Position Within Test Booklet and Overall 

Grade 4 
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Descriptive Statistics for Each Block of Items by Position Within Test Booklet and Overall 

Grade 8 
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Estimation block— administered using paced-tape procedures to all students u the fourth tnd final block 





the last item is not dramatically smaller than the proportion reaching the next-to-last item. 
However, for the blocks that ended with extended constructed-response items, use of the 
standard ETS convention resulted in an extremely large drop in the proportion of students 
attempting the final item. A drop of such magnitude seemed somewhat implausible. Therefore, 
for blocks ending with an extended constructed-response items, students who answered the next- 
to-last item but did not respond to the extended constructed-response item were classified as 
having intentionally omitted the last item. 

The average biserial correlation is the average, over items, of the item-level biserial 
correlations (r-biserial). For each item-level r-biserial, total block number-correct score 
(including the item in question, and with students receiving zero points for all not-reached 
items) was used as the criterion variable for the correlation. Data from students classified as 
not reaching the item were omitted from the calculation of the statistic. 

As is evident from Tables 9-3 and 9-4, the difficulty and the internal consistency of the 
blocks varied somewhat. Such variability was expected since these blocks were not created to be 
parallel in either difficulty or content. Based on the proportion of students attempting the last 
item, 2 blocks for the fourth grade and 3 blocks for the eighth grade seem to be somewhat 
speeded. Only 77 percent of the fourth-grade students taking block M5 (which required the use 
of a ruler) and 71 percent taking block M12 (which required a calculator) reached the last item 
in the block. Only 62 percent of the eighth-grade students taking block M7 (a calculator block), 
59 percent taking block M8, and 62 percent taking block M9 reached the last items of these 
blocks. 



These two tables also indicate that there was little variability in average item scores or 
average biserial correlations for each block by serial position within the assessment booklet. 

This suggests that serial position within booklet had a negligible effect on the overall difficulty of 
the block. However, for the fourth grade, one aspect of block level performance that did differ 
by serial position was the proportion of students attempting the last item in the block. As 
shown in Table 9-3, for blocks M5, M6, M7, M8, M10, Mil, M12,and M14, the percentage cf 
the students attempting the last item increased as the serial position of the block increased. 
Perhaps fourth-grade students are able to work more quickly in later blocks or are better able to 
pace themselves as a result of their experience with the first block of items that they attempt. It 
is interesting to note that this effect was particularly salient for blocks M5, M8, and M12. 

Blocks M8 and M12 required the use of a calculator and block M5 required the use of a ruler. 
Only one block at grade 8 showed a substantial position effect. Interestingly, this was again 
block M5, which required the use of a protractor/ruler. 

As mentioned earlier, in an attempt to maintain rigorous standardized administration 
procedures across the states, a randomly selected 50 percent of all sessions within each state was 
observed by a Westat-trained quality control monitor. Observations from this random half of 
the sessions provided information about the quality of administration procedures and the 
frequency of departures from standardized procedures in the monitored sessions (see Chapter 4, 
section 4.3.6 for a discussion of the results of these observations). In addition, unexpectedly 
large differences in results from monitored and unmonitored sessions (i.e., differences larger 
than those to be expected due to sampling fluctuation) provided a means to identify instances of 
cheating, breaches of test security, or other breaks in standardization occurring in the 
unmonitored sessions that might threaten the validity of assessment results. 
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When results were aggregated over all participating jurisdictions, there was little 
difference between the performance of students who attended monitored or unmonitored 
sessions. The average item score (over all 14 blocks and over all 44 participating jurisdictions) 
for the fourth-grade students was .46 for both monitored and unmonitored sessions. The 
average item score for the eighth grade for both monitored and unmonitored sessions was SI. 
Tables 9-5 and Table 9-6 provide, for each block of items, the average item score, average 
r-biserial, and the proportion of students attempting the last item for students whose sessions 
were monitored and students whose sessions were not monitored. Little or no differences in 
average item performance by session type were evident. These aggregate results are quite 
consistent with those observed in the 1990 Trial State Assessment, where no evidence was found 
that students who attended monitored sessions performed differently than those who attended 
unmonitored sessions. 

Figure 9-1 presents stem-and-leaf displays for grades 4 and 8 of the differences between 
monitored and unmonitored average item scores (over all 14 blocks) for each of the 44 
jurisdictions participating in the 1992 Trial State Assessment. Stem-and-leaf displays, developed 
by Tukey (1977), are somewhat like histograms. For this figure (and all other stem-and-leaf 
displays that follow), the first column contains observation depths (Hoaglin, Mosteller, & Tukey, 
1983). Depths are essentially cumulative frequencies, counted up from the lowest value for 
score intervals ("stems") below the median and counted down from the highest value for score 
intervals above the median. The second column contains a count of the number of "leaves" on 
each stem. In histogram terms, these counts would be considered frequencies. The remainder 
of the figure contains the stem-and-leaf display. The combination of a stem with each of its 
leaves gives the actual value of one observation (i.e., the difference in average item scores for 
monitored and unmonitored sessions in a participating jurisdiction). 

At the fourth grade, the median difference (monitored minus unmonitored) was .0015. 
For 21 jurisdictions, the difference was negative (Le., students from unmonitored sessions scored 
higher than students from monitored sessions), with the largest difference being -.021. For the 
remaining 23 jurisdictions, the difference was positive, with the largest difference being .028. In 
evaluating the magnitude of these differences, it should be noted that the standard error for a 
difference in proportions from independent simple random samples of size 1,250 (half the 
typical total state sample size of 2,500) from a population with a true proportion of 5 is about 
.02. For samples with complex sampling designs like NAEP, the standard errors tend to be 
larger than those associated with simple random sampling. A reasonable estimate of the design 
effect for average item scores based on past NAEP experience with item proportion correct 
statistics is about 1.5 (Johnson & Rust, 1992), which suggests that a typical estimate of the 
standard error of the difference between monitored and unmonitored sessions would be about 
.024. For 41 of the 44 participants, the absolute differences in item score means at fourth grade 
were less than .02, and all but one were less than .024. In summary, differences in results 
obtained from the two types of sessions at the fourth grade were well within the bounds 
expected due to sampling fluctuation. 

At the eighth grade, the median difference was essentially zero (.0005). However, the 
distribution of differences was somewhat negatively skewed. For 19 jurisdictions, the differences 
were negative. Two were larger in absolute magnitude than .024, both of which were negative 
(-.037 for the Virgin Islands and -.033 for Florida). For the remaining 25 jurisdictions, the 
difference was zero or positive, and all of them were less than or equal to .0 1 5 in magnitude. 
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Block-level Descriptive Statistics for Monitored and Unmonitored Sessions 
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Figure 9-1 



Stem-and-leaf Display* of State-by-state Differences in 
Average Item Scores (Monitored - Unmonitored) 
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N = 44, Median = 0.0015, Quartiies * -0.0045, 0.0100 
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* The first column of numbers shows observation depths; the second column shows the number of observations; 
the remainder of the figure contains the stem-and-leaf display. 
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Although th ’ presence of somewhat larger differences at grade 8 is worth noting, even these 
larger differences are probably less than 2 standard errors in magnitude. Thus, in sum, 
differences in results obtained from the two types of sessions at the eighth grade were also 
within the bounds expected due to sampling fluctuation. 



93J2 Differential Item Functioning (DIF) Analyses 

Prior to scaling, differential item functijiung (DIF) analyses were carried out on 1992 
NAEP mathematics data from the national cross-sectional samples at grades 4, 8, and 12 and 
the Trial State Assessment samples at grades 4 and 8. The purpose of these analyses was to 
identify items that were differentially difficult for various subgroups and to reexamine such items 
with respect to their fairness and their appropriateness for inclusion in the scaling process. The 
information in this section focuses mainly on the analyses conducted on the Trial State 
Assessment data. A description of the results based on the national assessment will appear in 
the forthcoming technical report for that assessment. 

The DIF analyses were based on the Mantel-Haenszel chi-square procedure, as adapted 
by Holland and Thayer (1988). The procedure tests the statistical hypothesis that the odds of 
correctly answering an item are the same for two groups of examinees that have been matched 
on some measure of proficiency (usually referred to as the matching criterion). The groups 
being compared are often referred to as the focal group (usually a minority group of interest, 
such as Black examinees or female examinees) and the reference group (usually White 
examinees or male examinees). The measure of proficiency used is typically the number-correct 
score on some collection of items. Separate analyses were performed for each block of items 
(i.e., data were pooled across booklets containing the block being analyzed), and number correct 
score on the block of items in question was used as the measure of proficiency. 

For each item in the assessment, an estimate was produced of the Mantel-Haenszel 
common odds-ratio, expressed on the ETS delta scale for item difficulty . The estimates 
indicate the difference between reference group and focal group item difficulties (measured in 
ETS delta scale units), and typically run between about +3 and -3. Positive values indicate 
items that are differentially easier for the focal group than the reference group after making and 
adjustment for the overall level of proficiency in the two groups. Similarly, negative values 
indicate items that are differentially harder for the focal group than the reference group. It is 
common practice at ETS to categorize each item into one of three categories (Petersen, 1988): 
M A" (items exhibiting no DIF), M B* (items exhibiting a weak indication of DIF), or "C* (items 
exhibiting a strong indication of DIF). Items in category A have Mantel-Haenszel values that do 
not differ significantly from 0 at the alpha = .05 level. Two conditions must be met in order for 
items to fall in category B The Mantel-Haenszel value for the item must: (1) be significantly 
greater than 0 but not significantly greater than 1 at the .05 level, and, (2) must be less than 1.5 
in absolute magnitude. Category C items are those with Mantel-Haenszel values that are 
significantly greater than 1 and larger than 15 in absolute magnitude. 

For each block of items at each grade a single set of analyses was carried out based on 
equal-sized random samples of data from all participating jurisdiction. Each set of analyses 
involved four reference group/focal group comparisons: male/female, White/Asian American, 




White/Black, and White/Hispanic. The first subgroup in each comparison is the reference 
group; the second subgroup is the focal group. 

All analyses used rescaled sampling weights. A separate rescaled weight was defined for 
each comparison as: 

Rescaled Weight = Original Weight x 

Sum of the Weights 

where the total sample size is the total number of students for the two groups being analyzed 
(e.g., for the White/Hispanic comparison, the total number of White and Hispanic examinees in 
the sample at that grade), and the sum of the weights is the sum of the sampling weights of all 
the students in the sample for the two groups being analyzed. Four rescaled weights were 
computed for White examinees — one for the gender comparison and three for the 
race/ethnicity comparisons. Two rescaled overall weights were computed for the Asian 
American, Black, and Hispanic examinees — one for the gender comparison and another for the 
appropriate race/ethnicity comparison. 

The ETS generalized program IANA83 was used to carry out the DIF analyses. Two- 
sided modification 3 was used. In the calculation of number-correct scores for the matching 
criterion, both not-reached and omitted items were considered as wrong responses. For each 
item, calculation of the Mantel-Haenszel statistic did not include data from examinees who did 
not reach the item in question. Because the Mantel-Haenszel procedure, as curren'Jy 
implemented, is appropriate only for dichotomously scored items, the extended constructed- 
response items had to be scored dichotomously for the DIF analyses. Extended constructed 
responses rated as "satisfactory" or "extended" were scored as correct; all other responses were 
scored as incorrect. 

At grade 4, 159 items were analyzed; at grade 8, 211 items were analyzed*. Items 
common to both grades underwent separate DIF analyses for ^ch grade. Tables 9-7 and 9-8 
provide a summary of the results of the DIF analyses for the grade 4 and grade 8 collections of 
items grouped by content or skill area. For each grade, the tables provide six sets of five 
frequency distributions for the categorized Mantel-Haenszel statistics for the items in each of 
the scales. The leftmost frequency distribution gives the number (and percent) of items in each 
of five categories (C+, B+, A, B-, C-) based on the largest absolute DIF value obtained for the 
item across the four reference group/focal group comparisons that were carried out. The 
remaining four frequency distributions give the number of items with indices in each DIF 
category for each of the four reference group/focal group comparisons. 



Modification refers to the procedure in which items classified as "C items in an initial DIF analysis are deleted from 
the matching criterion, and a second DIF analysis is run. Two-sided means that "C items are deleted from the criterion, 
regardless of which group they favor, 

Separate DXF indices were calculated for the individual component items of each testlet. No additional DIF analyses 
were carried out on the overall testlet score. 
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Tabic 9-7 

Frequency Distributions of DIF Statistics for Grade 4 Items Grouped by Content or Skill Area 



Category of Maximum Absolute DIF 
Value For All Comparisons 


Number of Items in Category of DIF Value 
for Each Comparison (Reference Group/Focal Group) 


DIF Category* 


Number 


Percent 


Male/Female | Whitc/Bla'ic 


White/Hispanic 


White/Asian Amer. 



Numbers and Operations 



C+ 


0 




0 


0 


u 


0 


B+ 


11 


17.5 


1 


1 


7 


4 


A 


43 


6S3 


62 


58 


51 


53 


B- 


7 


11.1 


0 


4 


3 


6 


C- 


2 


32 


0 


0 


2 


0 


Measurement 




— 


0.0 


0 


0 


0 


0 


■>^4 




133 


0 


0 


1 


3 




kB 


51.7 


26 


25 


24 


20 


B- 


19 


20.7 


3 


2 


2 


3 




mm 


13.8 


0 


2 


2 


3 


Geometry 


C+ 


1 


3.7 


1 


0 


0 


0 


B + 


8 


29.6 


2 


3 


0 


4 


A 


13 


48.1 


23 


22 


25 


21 


B- 


2 


7.4 


1 


0 


2 


1 


C- 


3 


11.1 


0 


2 


0 


1 


1 _ _ g 8 * 8 Statistics, and Probability 


C+ 


2 


10.0 


1 




0 


1 


B + 


0 


aa 


0 




0 


1 


A 


11 


h£K 


18 


14 


17 


14 


B- 


2 


KiB 


1 


4 


2 


0 


C- 


5 


25.0 


0 


2 


1 


4 


Algebra and Functions 


C+ 


— 

0 


0.0 


1 


0 


0 


0 


B + 


2 


113 


1 


1 


0 


0 


A 


11 


64.7 


14 


14 


16 


15 


B- 


3 


17.6 


1 


2 


0 


1 


C- 


1 


5.9 


0 


0 


1 


1 


Estimation 


C+ 


1 


5.0 


0 


1 


0 


0 


B+ 


3 


153 


0 


2 


0 


1 


A 


12 


60.0 


19 


13 


20 


19 


B- 


3 


15.0 


1 


3 


0 


0 


C- 


1 


5.0 


wmmm 


1 


0 


0 



* Categories are A, B, and C. (+) indicates items in the category that are differentially easier for the focal group; 
(-) indicates items in the category that are differentially more difficult for the focal group. 



























































Table 9*5 

Frequency Distributions of DIF Statistics for Grade 8 Items Grouped by Content or Skill Area 



Category of Maximum Absolute DIF 
Value For All Comparisons 


Number of Items in Category of DIF Value 
for Each Comparison (Reference Group/Focal Group) 


DIF Category* 


Number 


Percent 


Male/Female 


White /Black 


White/Hispanic 


White/Asian Amer. 


Numbers and Operations 


C+ 


6 


103 


1 


1 


1 


4 


B + 


10 


173 


7 


6 


1 


5 


A 


25 


43.1 


46 


46 


49 


43 


B- 


13 


22.4 


4 


2 


6 


6 


C- 


4 


6.9 


0 


3 


1 


0 


Measurement 


C+ 


0 


0.0 


0 


0 


0 


0 


B4 


4 


123 


1 


1 


1 


1 


A 


17 


53.1 


29 


26 


28 


26 


B- 


7 


21.9 


2 


4 


2 


3 


C- 


4 


123 


0 


1 


1 


2 


Geometry 


C+ 


3 


83 


0 


1 


0 


2 


B+ 


10 


273 


2 


3 


6 


6 


A 


14 


38.9 


30 


29 


28 


27 


B- 


9 


25.0 


4 


3 


2 


1 


C- 


0 


0.0 


G 


0 


0 


0 


Data Analysis, Statistics, and Probability 


C+ 


1 


3.6 


0 


0 


1 


0 


B+ 


3 


10.7 


1 


2 


0 


1 


A 


11 


393 


26 


23 


23 


15 


B- 


6 


21.4 


1 


1 


3 


7 


C- 


7 


25.0 


0 


2 


1 


5 


Algebra and Functions 


C+ 


5 


173 


1 


0 


0 


4 


B+ 


9 


31.0 


2 


1 


0 


7 


A 


10 


343 


26 


24 


27 


17 


B- 


3 


103 


0 


3 


1 


1 


C- 


2 


6 S 


0 


1 


1 


0 


Estimation 


C+ 


0 


0 D 


0 


0 


0 


0 


B+ 


2 


9.1 


0 


1 


0 


1 


A 


17 


773 


22 


18 


22 


20 


B- 


2 


9.1 


0 


2 


0 


1 


C- 


1 


43 


0 


1 


0 


0 



* Categories are A, B, and C. (+) indicates items in the category that are differentially easier for the focal group; 




A total of 20 items were classified as "C" items for at least one of the analyses for the 
fourth-grade Trial State Assessment data; 33 items were classified as "C items for at least one 
of the analyses for the eighth-grade Trial State Assessment data. For the grade 4 items, 80 
percent of the "C items (16 out of 20} were differentially more difficult for at least one of the 
four focal groups (female, Black, Hispanic, or Asian American examinees). Nine of these items 
covered topics in die content areas of Measurement and Data Analysis, Statistics, and 
Probability. The grade 8 "C" items were split about equally between those favoring the 
reference group and those favoring the focal group. A relatively large proportion of items (12 
of 28) covering topics in Data Analysis, Statistics, and Probability were differentially more 
difficult (B- or C-) for Asian American examinees. In contrast, differentially functioning items 
covering topics in Algebra and Functions, and, to a lesser extent. Geometry, tended to favor 
Asian American examinees. 

Following standard practice at ETS for DIF analyses conducted on final test forms, all 
*C* items were reviewed by a committee of trained test developers and subject-matter 
specialists. Such committees are charged with making judgments about whether or not the 
differential difficulty of an item is unfairly related to group membership. As pointed out by 
Zieky (1993): 

It is important to realize that DIF is not a synonym for bias. The item response 
theory based methods, as well as the Mantel-Haenszel and standardization 
methods of DIF detection, will identify questions that are not measuring the 
same dimension(s) as the bulk of the items in the matching criterion.. ..Therefore, 
judgement is required to determine whether or not the difference in difficulty 
shown by a DIF index is unfairly related to group membership. The judgement of 
fairness is based on whether or not the difference in difficulty is believed to be 
related to the construct being measured....The fairness of an item depends 
directly on the purpose for which a test is being used. For example, a science 
item that is differentially difficult for women may be judged to be fair in a test 
designed for certification of science teachers because the item measures a topic 
that every entry-level science teacher should know. However, that same item, 
with the same DIF value, may be judged to be unfair in a test of general 
knowledge designed for all entry-level teachers, (p. 340) 

The committee assembled to review NAEP items included both ETS staff and outside members 
with expertise in the field. It was the committee’s judgment that none of the "C items for the 
national or Trial State Assessment data were functioning differentially due to factors irrelevant 
to test objectives. Hence, aone of the items were removed from scaling due to differential item 
functioning. 



9.4 ITEM RESPONSE THEORY (IRT) SCALING 

Items at each grade were sorted into six distinct sets, one for each of the five 
mathematics content areas and one for estimation. Figure 9-2 contains stem-and-leaf displays of 
the average scores for the items comprising each of the six fourth-grade sets. Figure 9-3 
contains corresponding results for the eighth-grade item sets. The averages are based on the 
entire sample of students in the Trial State Assessment and use the same rescaled sampling 
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Figure 9-2 

Stem-and-leaf Display* of Average Scores for Items, by Scale, for Grade 4 



NUMBERS AND OPERATIONS 

N = 63, Median * 0.45, Quartiles * 0.3, 0.65 
Decimal point is 1 place to the left of the colon 



1 


1 


0 


6 


5 


1 


15 


9 


2 


24 


9 


3 




11 


4 


28 


9 


5 


19 


9 


6 


10 


C 

SJ 


7 


5 


4 


8 


1 


1 


9 



: 9 

: 07889 
: 000246678 
: 002556667 
: 02333455677 
: 024457789 
: 011578899 
: 01444 
: 3889 
: 1 



MEASUREMENT 

N *= 29, Median «*■ 0.43, Quartiles - 033, 0.6 
Decimal point is 1 place to the left of the colon 



1 


1 


0 


: 5 


2 


1 


1 


: 9 


6 


4 


2 


: 0112 


12 


6 


3 


: 035788 




7 


4 


: 1333459 


10 


2 


5 


: 67 


8 


2 


6 


: 03 


6 


4 


7 


: 2368 


2 


2 


8 


: 78 



(continued) 



* The first column of numbers shows observation depths; the second column shows the number of observations; 
the remainder of the figure contains the stem-and-leaf display. 
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Figure 9-2 (continued) 

Stem-and-leaf Display* of Average Scores for Items, by Scale, for Grade 4 



GEOMETRY 

N * 27, Median = 0.4, Quartiles - 0.25, 0.65 
Decimal point is 1 place to the left of the colon 



1 


1 


0 


: 6 


4 


3 


1 


: 134 


9 


5 


2 


: 02569 


13 


4 


3 


: 2247 




3 


4 


: 013 


11 


4 


5 


: 2249 


7 


4 


6 


: 5578 


3 


1 


7 


: 7 


2 


0 


8 




2 


2 


9 


: 01 



DAT* ANALYSIS, STATISTICS, AND PROBABILITY 

N = 20, Median « 0.475, Quartiles = 0.27, GJ545 
Decimal point is 1 place to the left of the colon 



1 1 1:1 

6 5 2 : 13668 

7 1 3:2 

5 4 : 14788 

8 6 5 : 233677 

2 2 6 : 34 



(continued) 



* The first column of numbers shows observation depths; the second column shows the number of observations; 
the remainder of the figure contains the stem-and-leaf display. 
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Figure 9-2 (continued) 

Stem-and-leaf Display* of Average Scores for Items, by Scale, for Grade 4 



ALGEBRA AND FUNCTIONS 

N = 16, Median = 0.34, Quartiles * 0.27, 0.52 
Decimal point is 1 place to the left of the colon 



1 11:8 

5 4 2 : 0559 

5 3 : 01267 

6 14:9 

5 3 5 : 047 

2 1 6:4 

1 0 7 : 

1 0 8: 

1 1 9:0 



ESTIMATION 

N = 19, Median « 0.54, Quartiles = 0.43, 0.68 
Decimal point is 1 place to the left of the colon 



1 1 2:5 

4 3 3 : 458 

7 3 4 : 358 

5 5 : 01446 

7 3 6 : 068 

4 1 7:2 

3 3 8 : 366 



* The first column of numbers shows observation depths; the second column shows the number of observations; 
the remainder of the figure contains the stem-and-leaf display. 
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Figure 9-3 

Stem-and-Ieaf Display* of Average Scores for Items, by Scale, for Grade 8 



NUMBERS AND OPERATIONS 

N «* 58, Median = 0.635, Quartiles = 0.42, 0.8 
Decimal point is 1 place to the left of the colon 



1 


1 


0 : 8 


2 


1 


1 : 6 


7 


5 


2 : 11488 


12 


5 


3 : 01389 


18 


6 


4 : 002699 


24 


S 


5 : 012579 




9 


6 : 001334699 


25 


9 


7 : 001256679 


16 


10 


8 : 0011233559 


6 


6 


9 : 011233 



MEASUREMENT 

N = 32, Median * 0.565, Quartiles - 0.265, 0.73 
Decimal point is 1 place to the left of the colon 



2 


2 


0 


88 


6 


4 


1 


0289 


9 


3 


2 


267 


13 


4 


3 


2235 


14 


1 


4 


7 




3 


5 


058 


15 


5 


6 


11448 


10 


5 


7 


23399 


5 ■ 


5 


8 


03479 



(continued) 



* The first column of numbers shows observation depths; the second column shows the number of observations; 
the remainder of the figure contains the item-and-leaf display. 
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Figure 9-3 (continued) 

Stem-and-leaf Display* of Average Scores for Items, by Scale, for Grade 8 



GEOMETRY 

N - 36, Median = 05, Quartiles = 0315, 0.695 
Decimal point is 1 place to the left of the colon 



1 


1 


0 


: 8 


1 


0 


1 




7 


6 


2 


: 144999 


13 


6 


3 


: 012469 


18 


5 


4 


: 24589 


18 


5 


5 


: 13789 


13 


4 


6 


: 0089 


9 


3 


7 


: 057 


6 


5 


8 


: 02268 


1 


1 


9 


: 2 



DATA ANALYSIS, STATISTICS, AND PROBABILITY 

N = 28, Median - 0.45'5, Quartiles = 0.215, 0.69 
Decimal point is 1 place to the left of the colon 



1 


1 


0 


9 


5 


4 


1 


1256 


9 


4 


2 


0129 


11 


2 


3 


56 


14 


3 


4 


568 


14 


3 


5 


189 


11 


4 


6 


1335 


7 


4 


7 


3467 


3 


3 


8 


799 



(continued) 



* The first column of numbers shows observation depths; the second column shows the number of observations; 
the remainder of the figure contains the stem-and-leaf display. 
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Figure 9-3 (continued) 

Stem-and-leaf Display* of Average Scores for Items, by Scale, for Grade 8 



ALGEBRA AND FUNCTIONS 

N = 29, Median « 0.48, Quartiles = 0.32, 0.69 
Decimal point is 1 place to the left of the colon 



1 1 0:8 

1 0 1 : 

6 5 2 : 00678 

10 4 3 : 1256 

5 4 : 12468 

14 5 5 : 00146 

9 4 6 : 5999 

5 2 7 : 58 

3 1 8:0 

2 2 9 : 56 



ESTIMATION 

N = 21, Median = 0.6, Quartiles « 0.41, 0.64 
Decimal point is 1 place to the left of the colon 



2 2 2 : 58 

5 3 3 : 147 

8 3 4 : 168 

10 2 5 : 49 

7 6 : 0034449 

4 2 7 : 35 

2 2 8 : 66 



* The first column of numbers shows observation depths; the second column shows the number of observations; 
the remainder of the figure contains the stem-and-leaf display. 
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weights described in the previous section. As a whole, both the fourth- and the eighth-grade 
students found the set of Algebra and Functions items to be the most difficult. The fourth 
graders found the set of Estimation items the easiest, while the eighth-grade students found the 
set of Numbers and Operations items to be easiest. 

Separate IRT-based scales corresponding to each of the item sets defined above were 
developed using the scaling models described in Chapter 8. For each grade, six scales were 
produced by separately calibrating the sets of items classified in each of the five content areas 
and the items in the estimation block. Since there were two grades and each had six scales, a 
total of 12 distinct calibrations were carried out. 

For the reasons discussed in Mazzeo (1991), for each scale at each grade, a single set of 
item parameters for each item was estimated and used lor all jurisdictions. Item parameter 
estimation was carried out using a 25 percent systematic random sample of the students 
participating in the 1992 Trial State Assessment at each grade and included equal numbers of 
students from each participating jurisdiction, half from monitored sessions and half from 
unmonitored sessions. For the fourth-grade calibrations, the sample consisted of 27,720 
students, with 630 students being sampled from each of the 44 participating jurisdictions. For 
the eighth-grade calibrations, the sample consisted of 27,016 students, with 614 students being 
sampled from each jurisdiction. As was done for 1990, all calibrations were carried out using 
the rescaled sampling weights described earlier in an effort to ensure that each jurisdiction’s 
data contributed equally to the determination of the item parameter estimates. 

As mentioned above, the sample used for item calibration was also constrained to 
contain an equal number of students from the monitored and unmonitored sessions from each 
of the participating jurisdictions. To the extent that items may have functioned differently in 
monitored and unmonitored sessions, the single set of item parameter obtained define a sort of 
average item characteristic curve for the two types of sessions. Tables 9-5 and 9-6 (shown 
earlier) presented block-level item statistics that suggested little, if any, differences in item 
functioning by session type. Figures 9-4 and 9-5 present the results of supplementary analyses 
organized by scale. 

Figures 9-4 (for grade 4) and 9-5 (for grade 8) contain plots of differences in score 
means (monitored minus unmonitored) against the score means for the monitored sessions for 
the items in each of the six scales. At grade 4, the differences between session type appear 
small on aJi 'scales, with a slight tendency for performance to be higher in the monitored 
sessions. At grade 8, for all but one scale, the average scores were quite similar for the two 
types of sessions with a tendency for performance to be slightly higher in the unmonitored 
sessions. For the eighth-grade Estimation scale, however, average score means were consistently 
higher in the unmonitored sessions than in the monitored sessions for almost all of items. 
Although the average difference over all items is small (.002), the item-by-item cr isistency of 
the results suggests that departures from standardized administration procedures in the 
unmonitored sessions may have occurred in one or more of the jurisdictions. 
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Differences in Average Item Scores (Moni*c»ed Minus Unmonitored) 
Plotted Against Monitored Average Item Scores, Grade 4 
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Figure 9-5 

Differences in Average Item Scores (Monitored Minus Unmonitored) 
Plotted Against Monitored Average Item Scores, Grade 8 











Figure 9-6 contains a stem-and-leaf display of the differences between monitored and 
unmonitored sessions averaged over items in the eighth-grade Estimation scale for each of the 
participating jurisdictions. The average item scores were not uniformly higher for all 
jurisdictions and the median difference was essentially zero (-.0019). In addition, the magnitude 
of the differences are all within a reasonable range given the expected variation due to sampling. 
However, there does appear to be a slight tendency toward higher performance by unmonito.^ed 
sessions on the estimation items. For 25 of the 44 jurisdictions, the differences were negative; 
for 14, the mean for the unmonitored sessions exceeded that of the monitored sessions by more 
than .01. A difference of this magnitude in the reverse direction occurred in only 5 jurisdictions. 
One of these differences, -.031 in Florida, is somewhat large compared with the magnitude of 
the differences observed in the other jurisdictions. Hie 'easons for this general tendency toward 
higher performance by unmonitored sessions — and in particular on the somewhat larger 
difference observed in Florida — cannot be determined. 



9.4.1 Item Parameter Estimation 

For each grade and each subscale, item parameter estimates were obtained by the NAEP 
BILOG/PARSCALE program, which combines Mislevy and Bock’s (1982) BELOG and Mur aid 
and Bock’s (1991) PARSCALE computer programs. Hie program uses marginal estimation 
procedures to estimate the parameters of the one-, two-, and three-parameter logistic models, 
and the generalized partial credit model described by Muraki (1992). 

All multiple-choice items were dichotomously scored and were scaled using the three- 
parameter logistic model. Omitted responses to multiple-choice items were treated as 
fractionally correct, with the fraction being set to 1 over the number of response options. All 
short constructed-response items were dichotomously scored and were scaled using the tw^- 
parameter logistic modeL Omitted responses to short constructed-response items were treated 
as incorrect. 

A key assumption associated with IRT scales is that of conditional independence. 
Conditional on proficiency, examinee >em responses are assumed to be independent. When 
sets of items are logically dependent on each other, or are based on a single stimulus, this, 
assumption can be violated to a degree that results in aberrant scaling results. In order to avoid 
possible problems with interitem dependencies, 4 testlets (2 at each grade) were created by 
combining examinee responses to sets of related items into a single score for each set At grade 
4, one 3-item and one 2-item testlet were created; at grade 8 two 4-item testlets were created. 
The testlets, rather than their original constituent items, were used in scaling the 1992 
mathematics assessment. In all cases, examinee testlet scores were defined as the number of 
correct responses given to each testlet’s constituent items. Examinees omitting all constituents 
of the testlet were placed in the "zero correct" category of the testlet. Examinees classified as 
"not reaching" all constituent parts were treated as having not been presented the testlet. All 
testlets were scaled using the generalized partial credit model. 
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Figure 9-6 



Stem-and-leaf Display* of State-by-state Differences in Average Item Score 
(Monitored - Unmonitored) for the Grade S Estimation Item Pool 



N « 44, Median - -0.0019, Quartiles « -0.012, 0.005 
Decimal point is 2 places to the left of the colon 



1 1-3:1 

1 0 -2 : 

4 3 -2 : 100 

9 5 *1 : 97775 

14 5 -1 : 32210 

17 3 -0 : 875 

8 -0 : 44432221 

19 6 0 : 011123 

13 8 0 : 55567889 

5 3 1 : 033 

2 2 1 : 68 



• The first column of numbers shows observation depths; the second column shows the number of observations; 
the remainder of the figure contains the stem-and-leaf display. 
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There was a total of 11 extended constructed-response items at grades 4 and 8. Each of 
these items was also scaled using the generalized partial credit model Five scoring levels were 
defined; 

0 Wrong, off-task, or omitted 

1 Minimal response 

2 Partially correct 

3 Satisfactory response 

4 Elaborated response 

Table 9-9 provides a listing of the blocks, content area classifications, and NAEP identification 
numbers for all extended constructed-response items included in the 1992 assessment. 



Table 9-9 

Extended Constructed-response Items* 
1992 Trial State Assessment in Mathematics 



Grade 


Block 


Scale 


NAEP IB 


Grade 4 


M7 


Numbers and Operations 


M045401 




M9 


Geometry 


M041201 




M13 


Algebra and Functions 


M043501 




M14 


Numbers and Operations 


M044401 




M15 


Data Analysis, Star., & Frob. 


M049Q01 


Grade 8 


M3 


Numbers and Operations 


MO51101 




M7 


Geometry 


M045901 




M9 


Data Analysis, Star., & Prob. 


M053101 




M12 


Algebra and Functions 


M054301 




M13 


Measurement 


M052201 




M14 


Numbers and Operations 


M055581 



* These items always appeared last in their respective blocks. The number of 
items in each block is shown in Tables 9-1 and 9-2. 

Bayes modal-estimates of all item parameters were obtained from the 
BILOG/PARSCALE program. Prior distributions were imposed on item parameters with the 
following starting values: thresholds (normal [0,2]); slopes (log-normal [0,5]); and asymptotes 
(two-parameter beta with parameter values determined as functions of the number of response 
options for an item and a weight factor of SO). The locations (but not the dispersions) were 
updated at each program estimation cycle in accordance with provisional estimates of the item 
parameters. 




■ i 
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As was done for the 1990 Trial State Assessment, item parameter estimation proceeded 
in two phases. First, the subject ability distribution was assumed fixed (normal [0,1]) and a 
stable solution was obtained. The parameter estimates from this solution were then used as 
starting values for a subsequent set of runs in which the subject ability distribution was freed and 
estimated concurrently with item parameter estimates. After each estimation cycle, the subject 
ability distribution was re-standardized to have a mean of zero and standard deviation of one. 
Correspondingly, parameter estimates for that cycle were also linearly re-standardized. During 
the concurrent estimation phase, convergence problems were encountered for the grade 4 Data 
Analysis, Statistics, and Probability scale. Therefore, for this scale, the converged normal 
solution results were used. 

During and subsequent to item parameter estimation, evaluations of the fit of the 1RT 
models were carried out for each of the items in the grade 4 and grade 8 item pools. These 
evaluations were conducted to determine the final composition of the item pool making up the 
scales by identifying misfitting items that could not be included. Evaluations of model fit were 
based primarily on a graphical analysis. For binary-scored items, model fit was evaluated by 
examining plots of estimates of the expected conditional (on 0) probability of a correct response 
that do not assume a two-parameter or three-parameter logistic model versus the probability 
predicted by the estimated item characteristic curve (see Mislevy & Sheehan, 1987, p. 302). For 
the testlets and extended constructed-response items, similar plots were produced for each item 
category characteristic curve. 

As with most procedures that involve evaluating plots of data versus model prediction:;, a 
certain degree of subjectivity is involved in determining the degree of fit necessary to justify use 
of the model There are a number of reasons why evaluation of model fit relied primarily on 
analyses of plots rather than seemingly more objective procedures based on goodness-of-fit 
indices such as the "pseudo chi-squares" produced in BILOG (Mislevy & Bock, 1982). First, the 
exact sampling distributions of these indices when the model fits are not well understood, even 
for fairty long tests. Mislevy and Stocking (1987) point out that the usefulness of these indices 
appears particularly limited in situations like NAEP where examinees have been administered 
relatively short tests. Work in progress by Stone, Mislevy, and Mazzeo using simulated data 
suggests that the correct reference chi-square distributions for these indices have considerably 
fewer degrees of freedom than the value indicated by the BBLOG/PARSCALE program and 
require additional adjustments of scale. However, it is not yet dear how to estimate the correct 
number of degrees of freedom and necessary scale factor adjustment factors. Consequently, 
pseudo chi-square goodness-of-fit indices are used only as rough guides in interpreting the 
severity of model departures. 

Second, as discussed in Chapter 8, it is almost certainly the case that, for most items, 
item-response models hold only to a certain degree of approximation. Given the large samples 
sizes used in NAEP and the Trial State Assessment, there will be sets of items for which one is 
almost certain to reject the hypothesis that the model fits the data even though departures are 
minimal in nature or involve kinds of misfit unlikely to impact on important model-based 
inferences. In practice, one is almost always forced to temper statistical decisions with 
judgments about the severity of model misfit and the potential impact of such misfit on final 
results. 
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In making decisions about excluding items from the final scales, a balance was sought 
between being too stringent, hence deleting too many items and possibly damaging the content 
representativeness of the pool of scaled items, and too lenient, hence including items with model 
fit poor enough to invalidate the types of model-based inferences made from NAEP results. 
Items that dearfy did not fit the model were not induded in the final scales; however, a certain 
degree of misfit was tolerated for a number of items induded in the final scales. 

For the large majority of the grade 4 and grade 8 items, the fit of the model was 
extremely good Figure 9-7 provides a typical example of what the plots look like for this da ss 
of items. The plots that are shown are for items from the grade 8 Algebra and Functions scale. 
The item at the top of the plot is a binary-scored constructed-response item; the item at the 
bottom of the plot is a multiple-choice item. In each plot, the y-axis indicates the probability of 
a correct response and the x-axis indicates proficiency level (theta). The cirdes show estimates 
of the conditional (on theta) probability of a correct response that do not assume a logistic form 
(referred to subsequently as nonlogistic-based estimates). The sizes of the cirdes are 
proportional to the estimated density of the theta distribution at the indicated value. The solid 
line shows the estimated item response function. The item response function provides estimates 
of the conditional probability of a correct response based on an assumed logistic form. The 
vertical dashed line indicates the estimated location parameter (b) for the item and the 
horizontal dashed line (bottom plot only) indicates the estimated lower asymptote (c). Also 
shown in the plot are the actual values of the item parameter estimates (lower right-hand 
comer) as well as the proportion of students that answered the item correctly (upper left-hand 
comer). As is evident from the plots, the nonlogistic-based estimates of conditional probabilities 
are in extremely dose agreement with those given by the estimated item response function. 

Figure 9-8 provides an example of a plot for a five-category extended constructed- 
response item exhibiting good model fit. Like the plots for the binary items, this plot shows two 
estimates of each item category characteristic curve, one set that does not assume the partial 
credit model (shown as cirdes) and one that does (the solid lines). The dashed horizontal lines 
show the location of the estimated category thresholds for the item (d x to d 4 ; see Chapter 8 t 
sections 83.1). The estimates for all parameters for the item in question are also indicated on 
the plot. As with Figure 9-7, the two sets of estimates agree quite well, although there is a slight 
tendency for the nonlogistic-based estimates for category two to be somewhat higher than the 
model-based estimates for theta values greater than 1. An aspect of Figure 9-8 worth noting is 
the large proportion of examinees that responded in the two lowest response categories for this 
item 5 . Such results were typical for the extended constructed-response items at both grades. 
Substantial proportions of examinees were either unable or unwilling to provide even minimally 
adequate answers to such items. 

As discussed above, some of the items retained for the final scales display some degree 
of model misfit. Figures 9-9 (binary-scored items) and 9-10 (extended constructed-response 
item) provide typical examples of such items. In general, good agreement between nonlogistic 
and logistic estimates of conditional probabilities were found for the regions of the theta scale 



*Thi» is evidenced by the relatively large size of the circles indicating estimated conditional probabilities for these two 
categories. 
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Figure 9-7 




Hots* Comparing Empirical and Model-based Estimates of Item Response Functions 
for Binary-scored Items Exhibiting Good Model Fit 



MO5O801 




M 0 1 S 3 0 1 




* Circles indicate estimated conditional probabilities obtained without assuming a Ingisty form; solid 
line indicates estimated item response function assuming a logistic form. 
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Figure 9-8 



Plot* Comparing Empirical and Model-based Estimates of Item Category Characteristic Curves 
for a Pofytomously Scored Item Exhibiting Good Model Fit 



M0 5 3 1 0 1 




Dj D; D’ D; 

THETA 



* Circles indicate estimated conditions. 1 , probabilities obtained 'without MSMimfag a logistic form; solid 
line indicates estimated item response function assuming a logistic form. 
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Figure 9-9 



Plots* Comparing Empirical and Model-based Estimates of Item Response Functions 
for Binary-scored Items Exhibiting Some Model Misfit 



M 0 2 0 5 0 1 




M0 399 0 1 




* Circles indicate estimated conditional probabilities obtained without assuming a logistic form; solid 
line indicates estimated item response function assuming a logistic form. 
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Figure 9-10 



Plot* Comparing Empirical and Model-based Estimates of Item Category Characteristic Curves 
for a Polytomously Scored Item Exhibiting Some Model Misfit 



M 0 4 1 2 0 1 




* Circles indicate estimated conditional probabilities obtained without ««nming a logistic form; solid 
line indicates estimated item response function assuming a logistic form. 
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with theta values in the tails of the subject ability distributions. As noted in Chapter 5, two of 
the extended construct ed-response items, one at grade 4 (see Figure 9*10) and one at grade 8, 
had interreader reliabilities somewhat lower than those of the remaining items. Both of these 
items did exhibit some degree of model misfit However, the primary effect of lower interreader 
reliability is to increase the imprecision of measurement rather than to bias the results. 

Only two of the administered items (one from the grade 4 estimation block and one 
from the grade 8 estimation block) were not included in the final scales. Plots for these items 
are given in Figure 9*11. As is evident from the nonlogistic-based estimates in the plots, both 
items appear to have nonm motonic item characteristic curves. Students with higher levels of 
proficiency exhibit lower chances of success than do students with lower proficiency. Logistic- 
based estimates of conditional probabilities are, by definition, monotonically increasing. Hence, 
the model does not fit. 

Table 9*10 lists the items that received special treatment during the scaling process. 
Included in the table are the block locations and item numbers for the items that were combined 
into testlets as well as for those that were excluded from the final scales. These items received 
identical special treatment in the production of the 1992 national scales. No other items in 
either assessment received special treatment. The IRT parameters for the items included in the 
Trial State Assessment are listed in Appendix D. 



Table 9-10 

Items from the 1992 Trial State Assessment in Mathematics Receiving Special Treatment 





Gmk 


Block 


Oniwla 

Block 


Content Am 


Tnatmcat 




M040401 


Hi 


M9 


2a 


Measurement 


Combined to form MM04451 


Local dependencies across items 


MQ40402 






Zb 








M 040403 


■ 




2c 








M044201 


Qj 


M14 


? 


Algebra 


Combined to form MQ44261 


Local dependencies across items 


M044202 






8 








M 050201 


s 


M3 




Data Analysis, 


Combined to form M050261 


Local dependencies across items 


M05Q202 








Statistics, and 






M 050203 








Probability 






MQS0204 






4d 








M 045801 


s 


M7 


12a 


Data Analysis, 


Combined to form M045861 


Local dependencies across items 


M 045802 






12b 


Statistics, and 






M 015803 






12c 


Probability 






M043804 






12d 








M032101 


4,8 


M16 


2 


Estimation 


Not scaled - grade 8 only 


Nonmonotonic IRF in 90 ft 92 


M032801 


4 * 


M16 


9 


Estimation 


Not scaled - grade 4 only 


Nonmonotonic IRF in 90 ft 92 



9.5 ESTIMATION OF STATE AND SUBGROUP PROFICIENCY DISTRIBUTIONS 

The proficiency distributions in each state (and for important subgroups within each 
state) were estimated by using the multivariate plausible values methodology and the 
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Figure 9-11 

Plots* Comparing Empirical and Model-based Estimates of Item Response Functions 
for Items Dropped from Scaling Due to Model Misfit 



M 0 3 2 8 0 1 




M 0 3 2 1 0 





A = 


0.026662 


8 = 


S . 0 6 S S 5 0 


r* _ . 

— 


0.052610 



* Circles indicate estimated conditional probabilities obtained without assuming a logistic form; solid 
line indicates estimated item response function assuming a logistic form. 










corresponding MGROUP computer program (described in Chapter 8; see also Mislevy, 1991). 
The MGROUP program (Sheehan, 1985; Rogers, 1991), which was originally based on the 
procedures described by Mislevy and Sheehan (1987), was used in the 1990 Trial State 
Assessment of mathematics. The 1992 Trial State Assessment used an enhanced version of 
MGROUP, based on modifications described by Thomas (1992), to estimate the proficiency 
distribution for both the fourth and the eighth grades in each state. As described in the 
previous chapter, MGROUP estimates proficiency distributions using information from student’s 
item responses, student background variables, and the item parameter estimates obtained from 
the BILOG/PARS CALE program. 

The enhancements included in the 1992 version of MGROUP included the replacement 
of Monte Carlo integration by analytic calculations, new methods for computing student-level 
posterior means and variances, and the generation of T values from their posterior distributions 
for the imputation of student proficiency values. Simulation studies indicate that the enhanced 
MGROUP produces more accurate estimates of subscale variances and correlations (Thomas, 
1992) than did the previous versions of MGROUP. 

For the reasons discussed in Mazzeo (1991), separate conditioning models were 
estimated at each grade for each jurisdiction. This resulted in the estimation of 88 distinct 
conditioning models. At each grade, the background variables included in each jurisdiction’s 
model (denoted y in Chapter 8) were principal component scores derived from the within-state 
correlation matrix of selected main-effects and two-way interactions associated with a wide range 
of student, teacher, school, and community variables. A set of five multivariate plausible values 
was drawn for each student who participated in the Trial State Assessment. 

As was the case in 1990, plans for reporting each jurisdiction’s results required analyses 
examining the relationships between proficiencies and a large number of background variables. 
The background variables included student demographic characteristics (e.g., the race/ethnicity 
if the student, highest level of education attained by parents), students’ perceptions about 
mathematics, student behavior both in and out of school (e.g., amount of TV watched daily, 
amount of mathematics homework done each day), the type of mathematics class being taken 
(e.g., algebra or general fourth- or eighth-grade mathematics), the amount of emphasis on 
various topics included in the assessment provided by the students’ teachers, and a variety of 
other aspects of the students’ background and preparation, the background and preparation of 
their teachers, and the educational, social, and financial environment of the schools they 
attended. 

As described in the previous chapter, to avoid biases in reporting results and to minimize 
biases in secondary analyses, it is desirable to incorporate measure: of a large number of 
independent variables in the conditioning model. When expressed in terms of contrast-coded 
main effects and interactions, the number of variables to be included totaled 258 at grade 4 and 
303 at grade 8. Appendix C provides a listing of the full set of contrasts defined at each grade. 
These contrasts were the common starting point in the development of the conditioning models 
for each of the participating jurisdictions. 

Because of the large number of these contrasts and the fact that, within each jurisdiction, 
some contrasts had zero variance, some involved relatively small numbers of individuals, and 
some were highly correlated with other contrasts or sets of contrasts, an effort was made to 
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reduce the dimensionality of the predictor variables in each jurisdiction’s MGROUP models. As 
was done for the 1990 Trial State Assessment, the original background variable contrasts were 
standardized and transformed into a set of linearly independent variables by extracting separate 
sets of principal components (one set for each grade for each of the 44 jurisdictions) from the 
within-jurisdiction correlation matrices of the original contrast variables. The principal 
components, rather than the original variables, were used as the independent variables in the 
conditioning model As was done for the 1990 Trial State Assessment, the number of principal 
components included for each state was the number required to account for approximately 90 
percent of the variance in the original contrast variables. Research based on data from the 1990 
Trial State Assessment suggests that results obtained using such a subset of the components will 
differ only slightly from those obtained using the full set (Mazzeo, Johnson, Bowker, & Fong 
1992). 



Tables 9-11 (for grade 4) and 9-12 (for grade 8) contain a listing of the number of 
principal components included in and the proportion of proficiency variance accounted for by 
the conditioning model for each of the 44 participating jurisdictions. It is important to note that 
the proportion of variance accounted for by the conditioning model differs across scales within a 
state, across grades within a state, and across states within a scale. Such variability is not 
unexpected for at least two reasons. First, there is no reason to expect the strength of the 
relationship between proficiency and demographics to be identical across all grades and states. 

In fact, one of the reasons for fitting separate conditioning models is that the strength and 
nature of this relationship may differ across states. Second, the homogeneity of the 
demographic profile also differs across states. As with any correlational analysis, the restriction 
of the range in the predictor variables will attenuate the relationship. 

Figures 9-12 (for grade 4) and 9-13 (for grade 8) provide boxplots (Tukey, 1977) of the 
estimated within-jurisdiction correlations among the six scales. One boxplot is provided for each 
of the 15 unique scale pairs and each boxplot is based on 44 data points (Le., the estimates of 
the indicated correlations from each of the 44 participating jurisdictions). The plotted values, 
taken directly from the revised MGROUP program, are estimates of the within-jurisdiction 
correlations conditional on the set of principal components included in the conditioning model. The 
box portion shows the locations of the 25th, 50th, and 75th percentiles. Generally, the whiskers 
extend to the minimum and maximum values. However, values more than 1.5 interquartile 
ranges from the median are plotted as individual points. 

The number and nature of the scales that were produced were consistent with the 
recommendations for reporting that were given by the National Assessment Planning Project 
(see Chapter 2). Reporting results on multiple scales is typically most informative when each of 
the scales provides unique information about the profile of knowledge and skills possessed by 
the students being assessed. In such cases, one would hope to see relatively low correlations 
among the subscales. However, with a couple of exceptions, the correlations among the 1992 
mathematics scales are high across all jurisdictions, almost always exceeding .7 and quite often 
exceeding .9. This is particularly noteworthy when one considers that these are correlations 
conditional on a rather large set of background variables. The marginal correlations between 
subscales would be higher, particularly for those correlations in the .7 to .8 range. In particular, 
the correlations among three of the scales (Numbers and Operations; Data Analysis, Statistics, 
and Probability; and Algebra and Functions) are extremely high (rarefy falling below .9) at both 
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Table 9-11 

Proportion of Proficiency Variance Accounted for by Grade 4 Conditioning Models 



State 


Number of 
Principal 
Components 


Numbers and 
Operations 


Measurement 


Geometry 


Data Analysis, 
Statistics and 
Probability 


Algebra and 
Functions 


Sstbnatloa 


Alabama 


126 


0.60 


0.66 


052 


0.64 


0.66 


051 


Arizona 


129 


0.57 


055 


0.43 


058 


057 


059 


Arkansas 


129 


OJ55 


059 


053 


0.61 


057 


0.64 


California 


128 


059 


0.63 


050 


0.67 


0.66 


0.64 


Colorado 


128 


0.49 


051 


0.41 


053 


057 


058 


Connecticut 


126 


059 


0.63 


0.49 


0.64 


0.68 


0.68 


Delaware 


122 


0.65 


0.67 


057 


0.71 


0.72 


0.67 


District of Columbia 


128 


0.57 


05? 


051 


0.62 


058 


0.63 


Florida 


131 


0-58 


057 


0.45 


0.62 


0.63 


0.64 


Georgia 


129 


0.66 


0.67 


055 


0.68 


0.68 


C.66 


Guam 


106 


0.66 


0.64 


0.64 


0.73 


0.65 


0.65 


Hawaii 


129 


055 


056 


0.47 


058 


0.61 


053 


Idaho 


127 


055 


030 


0.13 


0.44 


031 


058 


Indiana 


124 


052 


051 


032 


051 


059 


0.63 


Iowa 


123 


056 


057 


055 


059 


0.63 


0.63 


Kentucky 


131 


057 


056 


032 


0.62 


0.60 


056 


Louisiana 


130 


059 


0.64 


0.62 


0.67 


0.65 


0.67 


Maine 


122 


050 


056 


039 


0.60 


0.64 


055 


Maryland 


129 


0.69 


0.67 


0.70 


0.70 


0.62 


0.78 


Massachusetts 


127 


050 


0.61 


038 


059 


059 


058 


Michigan 


123 


0.67 


0.62 


0.62 


0.62 


055 


0.71 


Minnesota 


120 


0.61 


0.61 


052 


0.66 


0.66 


0.60 


Mississippi 


129 


055 


052 


0.46 


0.68 


059 


0.63 


Missouri 


126 


052 


0.63 


0.40 


0.48 


054 


0.61 


Nebraska 


124 


055 


0.61 


051 


0.66 


0.69 


056 


New Hampshire 


126 


051 


0.49 


029 


059 


053 


054 


New Jersey 


125 


0.60 


0.66 


058 


056 


0.61 


0.68 


New Mexico 


125 


0.64 


054 


050 


0.62 


056 


0.64 


New York 


126 


053 


0.68 


059 


0.71 


0.66 


0.72 


North Carolina 


133 


0.65 


0.61 


053 


0.65 


057 


0.68 


North Dakota 


120 


0.42 


0.45 


024 


0.42 


054 


0.49 


Ohio 


128 


0.62 


0.68 


056 


0.67 


0.66 


0.71 


Oklahoma 


129 


0.47 


0.47 


0.43 


0.49 


0.48 


056 


Pennsylvania 


127 


059 


0.66 


054 


0.72 


0.63 


0.69 


Rhode Island 


125 


0.61 


0.63 


0.48 


0.68 


0.66 


0.60 


South Carolina 


133 


0.64 


052 


052 


0.61 


0.70 


0.72 


Tennessee 


131 


059 


0.63 


0.49 


0.60 


0.44 


059 


Texas 


128 


0.61 


0.67 


051 


059 


0.62 


0.69 


Utah 


130 


054 


053 


0.40 


0.63 


0.60 


0.40 


Virginia 


128 


0.60 


055 


055 


0.64 


0.62 


0.69 


Virgin Islands 


98 


052 


0.64 


0.49 


0.76 


0.70 


054 


West Virginia 


127 


059 


0.60 


0.45 


057 


057 


0.44 


Wisconsin 


125 


0.68 


0.63 


059 


0.48 


| 0.67 


051 


Wyoming 


127 


038 


0.46 


027 


0.43 


0.49 


0.44 
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Table 9-12 

Proportion of Proficiency Variance Accounted for by Grade 8 Conditioning Models 



State 


Number «r 
Principal 
CempeiMBts 


Numbers and 
Operations 


Measurement 


Geometry 


Data Analysis, 
Statistics, * n| i 
Probability 


Algebra and 
Functions 


Estimation 


Alabama 


149 


0.64 


0,67 


056 


0.67 


mmm 


053 


Arizona 


148 


0.62 


0.64 


056 


0.64 


.HI 


059 


Arkansas 


148 


0.65 


0.61 


058 


056 


0.78 


0.53 


California 


150 


0.68 


0.66 


0.63 


0.70 


031 


057 


Colorado 


149 


0.64 


0.64 


056 


0.64 


0.79 


052 


Connecticut 


150 


0.72 


0.72 


0.65 


0.75 


052 


0.70 


Delaware 


129 


0.70 


0.71 


0.63 


0.73 


054 


0.73 


District of Columbia 


147 


057 


057 


059 


0.66 


053 


0.63 


Florida 


153 


0.67 


0.66 


0.61 


0.70 


031 


0.65 


Georgia ! 


152 


0.64 


0.65 


0.62 


0.66 


0.80 


0.64 


Guam 


106 


0.77 


0.71 


059 


0-77 


055 


0.70 


Hawaii 1 


140 


059 


059 


0.64 


0.70 


054 


059 


Idaho 


139 


053 


0.61 


039 


0.47 


0.70 


0.45 


Indiana 


150 


0.65 


0.64 


0.64 


0.70 


031 


0.66 


Iowa 


142 


055 


058 


053 


0.60 


0.74 


059 


Kentucky 


145 


0.73 


0.62 


057 


0.69 


052 


0.65 


Louisiana 


149 


0.60 


057 


0.56 


0.64 


0.76 


054 


Maine 


138 


0.49 


059 


0.44 


055 


0.74 


0.60 


Maryland 


145 


0.76 


0.74 


0,74 


0.78 


053 


0.72 


Massachusetts 


143 


0.63 


0.70 


"58 


0.71 


0.78 


0.62 


Michigan 


149 


0.64 


0.63 


0.61 


0.63 


050 


0.65 ♦ 


Minnesota 


141 


057 


056 


0.64 


0.63 


0.79 


059 


Mississippi 


151 


0.62 


0.61 


053 


0.78 


0.83 


055 


Missouri 


150 


0.67 


057 


052 


0.64 


0.75 




Nebraska 


137 


058 


0.61 


056 


059 


0.77 


056 


New Hampshire 


137 


0.65 


059 


052 




0.74 


056 


New Jersey 


145 


0.68 


0.72 


0.70 


0.72 


052 


059 


New Mexico 


145 


0.66 


0.65 


052 


0.70 


0.77 


059 


New York 


144 


0.73 


0.76 


0.70 


0.79 


056 


0.79 


North Carolina 


154 


0.66 


0.60 


054 


0.67 


051 


0.67 


North Dakota 


130 


0.42 


059 


0.49 


0.48 


0.65 


0.43 


Ohio 


145 


0.73 


0.66 


0.64 


0.76 


052 


0.71 


Oklahoma 


147 


062 


0.65 


0.60 


0.66 


0.78 


0.62 


Pennsylvania 


151 


0.61 


0.63 


0.66 


0.67 


0.79 


055 


Rhode Island 


134 


0.69 


0.68 


056 


0.T 


0.79 


0.66 


South Carolina 


151 


0.70 


0.72 


0.68 


0.73 




058 


Tennessee 


149 


0.64 


0.64 


056 


0.69 


0.78 


055 


Texas 


152 


Q.68 


0.70 


0.62 


0.68 


0.78 


059 


Utah 


148 


056 


0.60 


0.47 


056 


0.77 


0.60 


Virginia 


150 


0.65 


0.62 


0.67 


0.69 


031 


0.65 


Virgin Islam , 


106 


0.64 


056 


052 


0.61 


052 


053 


West Virginia 


147 


0.62 


0.62 


054 


059 


0.76 


055 


Wisconsin 


142 


055 


058 


0.60 


0.64 


0.78 


0.62 


| Wyoming 


134 


0.62 


056 


0.49 


0.62 


0.75 


056 
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Figure 9-12 

Boxplots of Estimated Scale Corr elations*, Grade 4 
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Figure 9-13 

Boxplots of Estimated Scale Correlations*, Grade 8 
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grades. At the fourth grade, and to a somewhat lesser extent at the eighth grade, the estimated 
correlations between Geometry and the remaining scales are noticeably lower than the 
correlations among the remaining scales and rarely exceed .9. In addition, there appears to be 
somewhat greater variability across jurisdictions in the correlations involving the Estimation 
scales, again with the effect being more clearly pronounced at the fourth grade. Furthermore, 
the correlation between the Geometry and Estimation scales is almost always the lowest among 
the set of correlations. 

As discussed in Chapter 8, NAEP scales are viewed as summaries of consistencies and 
regularities that are present in item-level data. Such summaries should agree with other 
reasonable summaries of the item-level data. In order to evaluate the reasonableness of the 
scaling and estimation results, a variety of analyses were conducted to compare state-level and 
subgroup level performance in terms of the content area scaled scores and in terms of the 
average item score for the set of items in a content area. High agreement was found in all of 
these analyses. One set of such analyses is presented in Figures 9-14 and 9-15. The figures 
contain scatterplots of the state item score mean versus the state scale score means, for each of 
the five mathematics content areas. As is evident from the figures, there is an extremely strong 
relationship between the estimates of state-level performance in the scale-score and item-score 
metrics for all six content areas. 



9.6 LINKING STATE AND NATIONAL SCALES 

A major purpose of the Trial State Assessment Program was to allow each participating 
jurisdiction to compare its 1992 results at each grade level with the nation as a whole and with 
the region of the country in which that jurisdiction is located. Because 1992 was the second 
round of the Trial State Assessment, an additional goal was to provide an opportunity to 
compare 1992 results to those obtained in 1990 for those jurisdictions participating in both 
assessments. 

For meaningful comparisons to be made between each of the Trial State Assessment 
participants and the relevant national samples, results from these two assessments had to be 
expressed in terms of a similar system of scale units. In addition, to allow for valid comparisons 
between grades, the systems of scale units for the fourth- and eighth-grade scales needed to be 
aligned and properly calibrated. Furthermore, the scales needed to be comparable to those used 
in 1990 to allow for meaningful assessment of changes in proficiency levels for jurisdictions 
participating in both assessments. 

The fourth -grade and eighth-grade item pools did share a set of common items. 

However, as described in the previous section, separate scales were produced for the fourth and 
eighth grades in independent BILOG/PARSCALE calibrations. The units and origin of these 
scales were set by standardizing the within-grade proficiency distributions for their respective 
calibration samples to have a mean of zero and standard deviation of one. Thus, without 
further adjustment, the corresponding grade 4 and grade 8 scales were not expressed in similar 
systems of units. Some form of scale linking or calibration was required. In addition, although 
the fourth and eighth graders in the 1992 Trial State Assessment were administered the same 
test booklets as the fourth and eighth graders in the national assessment, separate state and 
national scalings were carried out (for reasons explained in Mazzeo, 1991, and Yamamoto and 
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Mazzeo, 1992). Again, to ensure a similar scale unit system for the state and national metrics, 
the metrics had to be linked. Plans for the scaling of die 1992 national assessment included 
procedures linking the 1992 scales to their 1990 counterparts. These procedures are described 
in the forthcoming technical report of the 1992 national assessment Since the 1990 Trial State 
Assessment scales had already been linked to the 1990 national scales, linking the 1992 Trial 
State Assessment scales to their 1992 national counterparts indirectly linked the 1992 Trial State 
Assessment scales to the 1990 Trial State Assessment scales. 

The purpose of this section is to describe the procedures used to align the 1992 Trial 
State scales with their 1992 national counterparts. The procedures that were used represent an 
extension of the common population equating procedures employed to link the 1990 national 
and state scales (Mazzeo, 1991; Yamamoto & Mazzeo, 1992). 

Using the sampling weights provided by Westat, the combined sample of students from 
all participating jurisdictions was used to estimate the distribution of proficiencies for the 
population of students enrolled in public schools in the participating states and the District of 
Columbia 6 . Separate estimates were obtained for grades 4 and 8, with total sample sizes of 
108,154 and 105,275, respective^. Data were also used from a subsample of the national 
assessment at grades 4 and 8, each consisting of grade-eligible public-school students from any 
of the 44 jurisdictions who participated in the 1992 Trial State Assessment, along with 
appropriate weights provided by Westat, to obtain estimates of the distribution of proficiency for 
the same target population. Again, separate estimates were produced for fourth grade (based 
on 5,198 students) and eighth grade (5,605 students). 

Thus, for each of the 12 scales, two sets of proficiency distributions were obtained. One 
set, based on the sample of combined data from the Trial State Assessment (referred to as the 
Trial State Assessment Aggregate Sample) and using item parameter estimates and conditioning 
results from that assessment, was in the metric of the 1992 Trial State Assessment. The other, 
based on the sample from the 1992 national assessment (referred to as the State Aggregate 
Comparison, or SAC, sample) and obtained using item parameters and conditioning results from 
that assessment, was in the reporting metric of the 1992 national assessment. The latter metric 
had already been linked to the 1990 national reporting metric using the procedures described in 
the forthcoming technical report of the 1992 national assessment. The 12 Trial State 
Assessment and national scales were made comparable by constraining the mean and standard 
deviation of the two sets of estimates to be equal. 

More specifically, the following steps were followed to linearly link the scales of the two 
assessments: 

1) For each scale at each grade, estimates of the proficiency distribution for the 
Trial State Assessment Aggregate Sample was obtained using the full set of 
plausible values generated by the MGROUP program. The weights used were 
the final sampling weights provided by Westat, not the rescaled versions discussed 



Students from Guam and the Virgin Islands were excluded from the definition of this target population; hence, data 
from students from these jurisdictions were not included in the combined Trial State Assessment samples at grade 4 or grade 
8 . 



228 




in section 93. For each grade and each scale, the arithmetic mean of the five 
sets of plausible values was taken as the overall estimated mean and the 
geometric mean of the standard deviations of the five sets of plausible values was 
taken as the overall estimated standard deviation. 

2) For each scale at each grade, the estimated proficiency distribution of the SAC 
sample was obtained, again using the full set of plausible values generated by the 
MGROUP program. The weights used were specially provided % Westat to 
allow for the estimation of proficiency for the same target population of students 
estimated by the state data. Hie means and standard deviations of the 
distributions (in the 1992 national reporting metric) for each scale at each grade 
were obtained for this sample in the same manner as described in step 1. 

3) For each scale at each grade, a set of linear transformation coefficients were 
obtained to link the state scale to the corresponding national scale. The linking 
was of the form 



r = k, + kj 

where 



Y = a scale level in terms of the system of units of the provisional 

BILOG/PARSCALE scale of the Trial State Assessment scaling 

Y* = a scale level In terms of the system of units comparable to those 
used for reporting the 1992 national mathematics results 

h> - [Standard-Deviations^cl/lStandard-DeviationTjyJ 

kj = Means*,; - .^[Mean^*] 

Hie final conversion parameters for transforming plausible values from the provisional 
BILOG/PARSCALE scales to the final Trial State Assessment reporting scales are given in 
Table 9-13. All Trial State Assessment results are reported in terms of the Y* metric. 

It is important to re-emphasize two features of the linking procedures just described. 
First, the 1992 national scales had already been linked to their 1990 counterparts. Hence, the 
linking just described places the 1992 state scales on a metric comparable to that used for the 
1990 national scales. Since the 1990 state metric was also made comparable to those same 
national scales, the 1992 and 1990 state results are in comparable metrics. Second, the 1990 
national scales for each content area and for estimation were across-grade scales spanning 
grades 4, 8, and 12. Each had been produced by concurrently scaling the items from all three 
grade levels in a single BILOG calibration (see Yamamoto & Jenkins, 1992). For each content 
area, and for estimation, the grade 4 and grade 8 1992 state scales have been calibrated to the 
same 1990 across-grade scale. Hence, the grade 4 and grade 8 Trial State Assessment results 
are also on comparable scales. 
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Table 9-13 



Transformation Constants for the Grade 4 and Grade 8 Scales 





Grade 4 


Grade 8 


Scale 




k» 




k* 


Numbers and Operations 


215.53 


34.78 


268.25 


3455 


Measurement 


220.60 


33.95 


26230 


4354 


Geometry 


219.95 


29.16 


260.16 


3430 


Data Analysis, Statistics, and Probability 


217.87 


30.96 


264.48 


4032 


Algebra and Functions 


217.00 


29.66 


263.61 


36.61 


Estimation 


20531 


35.79 


267.11 


2835 














As evident from the discussion above, a linear method was used to link the scales from 
the Trial State and national assessments. 'While these linear methods ensure equality of means 
and standard deviations for the Trial State Assessment aggregate (after transformation) and the 
SAC samples, they do not guarantee the shapes of the estimated proficiency distributions for the 
two samples to be the same. As these two samples are both from a common target population, 
estimates of the proficiency distribution of that target population based on each of the samples 
should be quite similar in shape in order to justify strong claims of comparability for the Trial 
State and national scales. Substantial differences in the shapes of the two estimated 
distributions would result in differing estimates of the percentages of students above 
achievement levels or of percentile locations depend! ig on whether Trial State or national scales 
were used — a dearly unacceptable result given claims about comparability of scales. In the face 
of such results, nonlinear linking methods would be required. 

Analyses were carried out (one set of analyses for grade 4 and one set for grade 8) to 
verify the degree to which the linear linking process described above produced comparable 
scales for Trial State and national results. Comparisons were made between two estimated 
proficiency distributions, one based on the Trial State Assessment aggregate and one based on 
the SAC sample, for each of the six mathematics scales. The comparisons were carried out 
using slightly modified versions nf what Wainer (1974) refers to as suspended rootograms. The 
final reporting scales for the Trial State and national assessments were each divided into 10- 
point intervals. Two sets of estimates of the percentage of students in each interval were 
obtained, one based on the Trial State Assessment aggregate sample and one based on the SAC 
sample. Following Tukey (1971), the square root of these estimated percentages were 
compared 7 . 

The comparisons are shown in Figures 9-16 through 9-21. The heights of each of the 
unshaded bars correspond to the square root of the percentage of students from the Trial State 
Assessment aggregate sample in each 10-point interval on the final reporting scale. The shaded 
bars show the differences in root percents between the Trial State Assessment and SAC 
estimates’. Positive differences indicate intervals in which the estimated percentages from the 
State Aggregate Comparison sample are lower than those obtained from tbe Trial State 
Assessment aggregate. Conversely, negative differences indicate intervals in which the estimated 
percentages from the State Aggregate Comparison sample are higher. For all six scales at both 
grades, differences in root percents are quite small, suggesting that the shapes of the two 
estimated distributions are quite similar (Le., unimodal with slight negative skewness). There is 
some evidence that the estimates produced using the Trial State Assessment data are slightly 
heavier in the extreme lower tails (below 100 for the grade 4 scales and below 150 for the grade 



*rhe square root transformation allows for more effeedve comparisons for counts (or equivalently, percentages) when 
the expected number of counts in each interval is likely to vary greatly over the range of intervals, as is the case for the 
NAEP scales where the expected counts of individuals in intervals near the extremes of the scale (e.g., below 150 and above 
350) are dramatically smaTer than the counts obtained near the middle of the scale. 

*Waincr (1974), among others, has suggested that looking at residuals around a fitted straight line makes judgments of 
differences somewhat easier to make. Hence, the differences between the root perunts — rather than separate sets of root 
percent*— from the SAC sample and the Trial State Assessment aggregate are plotted around the x-axis in Figures 9-16 
through 9-21. 
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Figure 9-16 




kootogram Comparing Proficiency Distributions 
for the Trial State Assessment Aggregate Sample 
and the State Aggregate Comparison Sample from the National Assessment 
for the Numbers and Operations Scale 
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Figure 9-17 




Raotogram Comparing Proficiency Distributions 
for the Trial State Assessment Aggregate Sample 
and the State Aggregate Comparison Sample from the National Assessment 
for the Measurement Scale 
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Figure 9-18 




Rootogram Comparing Proficiency Distributions 
for the Trial State Assessment Aggregate Sample 
and the State Aggregate Comparison Sample from the National Assessment 

for the Geometry Scale 



Geometry - Grade 4 
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Figure 9-19 




Rootogram Comparing Proficiency Distributions 
for the Trial State Assessment Aggregate Sample 
and the State Aggregate Comparison Sample from the National Assessment 
for the Data Analysis, Statistics, and Probability Scale 
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Figure 9-20 



Rootogram Comparing Proficiency Distributions 
for the Trial State Assessment Aggregate Sample 
and the State Aggregate Comparison Sample from the National Assessment 
for the Algebra and Functions Scale 



Algebra - Grade 4 
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Algebra - Grade 8 
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Figure 9-21 




Rootogram Comparing Proficiency Distributions 
for the Trial State Assessment Aggregate Sample 
and the State Aggregate Comparison Sample from the National Assessment 

for the Estimation Scale 



Est i mat ion - Grade A 
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8 scales). However, even these differences at the extremes are small in magnitude (2 in the 
root percent metric, .04 in the percent metric) and have little impact on estimates of reported 
statistics such as percentages of students below the achievement levels. 



9.7 PRODUCING A MATHEMATICS COMPOSITE SCALE 

For the national assessment, composite scales were created for both fourth and eighth 
grade as overall measures of mathematics proficiency for students at that grade. The composite 
was a weighted average of plausible values on the five content area scales (Numbers and 
Operations; Measurement; Geometry; Data Analysis, Probability, and Statistics; and Algebra 
and Functions). The weights for the national content area scales were proportional to the 
relative importance assigned to each content area for each grade in the assessment 
specifications developed by the Mathematics Objectives PaneL Consequently, the weights for 
each of the content areas are similar to the actual proportion of items from that content area at 
each grade. 

Trial State Assessment composite scales were developed using weights identical to those 
used to produce the composites for the 1992 national mathematics assessment The weights are 
given in Table 9-14. In developing the Trial State Assessment composite for each grade, the 
weights were applied to the plausible values for each content area scale as expressed in terms of 
the final Trial State Assessment scales for each grade (i.e., after transformation from the 
provisional BELOG/PARSCALE scales.) 

Figure 9-22 provides rootograms comparing the estimated proficiency distributions based 
on the Trial State Assessment and SAC samples for the grade 4 and grade 8 composites. 
Consistent with the results presented separately by scale, there is some evidence that the 
estimates produced using the Trial State Assessment data are slightly heavier in the extreme 
lower tails than the corresponding estimate based on the SAC data. However, again these 
differences in root relative percents are small in magnitude. 




Table 9-14 



Weights Used for Each Scale to Form Grade 4 and Grade 8 Composites 



Scale 


Grade 4 


Grades 


Numbers and Operations 


.45 


JO 


Measurement 


20 


.15 


Geometry 


.15 


20 


Da- Analysis, Statistics, and Probability 


.10 


.15 


Algebra and Functions 


.10 


20 
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Figure 9*22 

Rootogram Comparing Proficiency Distributions 
for fbe Trial State Assessment Aggregate Sample 
and the State Aggregate Comparison Sample from the National Assessment 

for the Composite Scale 
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